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Abstract 



This thesis describes the theoretical and practical foundations of a system 
for the static analysis of XML processing languages. The system relies on 
a fixpoint temporal logic with converse, derived from the /^.-calculus, where 
models are finite trees. This calculus is expressive enough to capture regular 
tree types along with multi-directional navigation in trees, while having a single 
exponential time complexity. Specifically the decidabiUty of the logic is proved 
in time 2*^^"^ where n is the size of the input formula. 

Major XML concepts are linearly translated into the logic: XPath naviga- 
tion and node selection semantics, and regular tree languages (which include 
DTDs and XML Schemas). Based on these embeddings, several problems of 
major importance in XML applications are reduced to satisfiability of the logic. 
These problems include XPath containment, emptiness, equivalence, overlap, 
coverage, in the presence or absence of regular tree type constraints, and the 
static type-checking of an annotated query. 

The focus is then given to a sound and complete algorithm for deciding the 
logic, along with a detailed complexity analysis, and crucial implementation 
techniques for building an efi'ective solver. Practical experiments using a full 
implementation of the system are presented. The system appears to be efficient 
in practice for several realistic scenarios. 

The main application of this work is a new class of static analyzers for pro- 
gramming languages using both XPath expressions and XML type annotations 
(input and output). Such analyzers allow to ensure at compile-time valuable 
properties such as type-safety and optimizations, for safer and more efficient 
XML processing. 
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Resume 



Ccttc these presente les fondements theoriques et pratiques d'un systemc pour 
I'analyse statique de langages manipulant des documents et donnees XML. Le 
systeme s'appuie sur une logique temporelle de point fixe avec programmes in- 
verses, derivcc du /i-calcul modal, dans laquelle les modeles sont dcs arbres finis. 
Cette logique est sufRsamment expressive pour prendre en compte les langages 
reguliers d'arbres ainsi que la navigation multidirectionnelle dans les arbres, 
tout en ayant une complcxite simplcmcnt cxponcnticUc. Plus prcciscmcnt, la 
decidabilite de cette logique est prouvee en temps 2'^^") oii n est la taille de la 
formule dont le statut de verite est determine. 

Les principaux concepts do XML sont traduits lincairement dans cette 
logique. Ces concepts incluent la navigation et la semantique de selection 
de noeuds du langagc dc requetes XPath, ainsi que les langages de schemas 
(incluant DTD et XML Schema). Grace a ces traductions, les problemes 
d'importance majeure dans les applications XML sont reduits a la satisfais- 
abilite de la logique. Ces problemes incluent notamment I'inclusion, la satisfais- 
abilite, I'equivalence, rintcrscction, Ic recouvrement dcs rcquctcs, en presence 
ou en I'absence de contraintes regulieres d'arbres, et le typage statique d'une 
requete annotee. 

Un algorithme correct et complet pour decider la logique est propose, ac- 
compagne d'une analyse detaillee de sa complexite computationnelle, et des 
techniques d'implantation cruciales pour la realisation d'un solveur efRcace en 
pratique. Des experimentations avec I'implantation complete du systeme sont 
presentees. Le systeme apparait efRcace et utilisable en pratique sur plusieurs 
scenarios realistes. 

La principale application de ce travail est une nouvelle classe d'analyscurs 
statiques pour les langages de programmation utilisant des requetes XPath et 
des types reguliers d'arbres. De tels analyseurs permettent de s'assurer, au 
moment de la compilation, dc proprietes importantes comme le typage correct 
des programmes ou leur optimisation, pour un traitement plus sur et plus 
efRcace des donnees XML. 
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This manuscript presents my research work clone at the InstMut National de Re- 
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Chapter 1 

Introduction 



1.1 Motivation and Objectives 

This work was initially motivated by the need for cfRcicnt static type check- 
ers for XML processing languages. Such programming languages use schemas 
[Fallsidc and Walmslcy, 2004] and XPath [Clark and DeRose, 1999] queries as 
first class language constructs. Current examples of these languages include 
the W3C recommendation XSLT [Clark, 1999] for the transformation of XML 
documents, and the forthcoming XQuery [Boag ct al.. 2006] recommendation 
for querying XML databases. Providing such languages with decidable and 
efficient static type systems has been one of the major research challenges 
over the last decade, notably gathering the programming language, database 
theory, structured documents, and theoretical computer science communities. 
This work follows the research effort initiated in [Murata, 1996, Tozawa, 2001, 
Milo ot al., 2003, Hosoya and Pierce, 2003]. 

This work resulted in the design of a new logic of finite trees adapted for 
XML, and its decision procedure, presented in this dissertation. The logical 
solver has been implemented as the core of a system for the general static 
analysis and type-checking of XML specifications. The system can be used as 
a component of static analyzers for programming languages manipulating both 
XPath expressions and XML type annotations. 

This dissertation presents the theoretical investigations that led to the foun- 
dations of this new logic of finite trees, along with the algorithmic bases and 
implementation principles on which the logical solver relies. These discoveries 
are applied to the resolution of XML type-checking problems, which are em- 
bedded in the logic. Solved problems include static typing of XPath in the 
presence of regular tree type constraints. 

1.1.1 XML Documents and Schemas 

Extensible Markup Language (XML) [iJray ct al., 2004] is a text file format for 
representing tree structures in a standard form. 

The whole structure of an XML document, if we abstract over less important 
details, is a tree of variable arity, in which nodes (also called elements in the 
XML jargon) are labeled, leaves of the tree are text nodes, and the ordering 



1 



1 . Introduction 



between children of a node is significant. XML can be seen as a concrete syntax 
for describing such tree structures using mark-up texts. An example of an XML 
document is as follows: 

<plant> 

<category>Vascular</category> 
<tissue> 

<naine>Phloem</iiame> 

<def>The phloem is a living tissue that carries organic 

nutrients to all parts of the plant where needed. </def> 
<note>In trees, the phloem is part of the bark.</note> 
</tissue> 
</plant> 

An element is described by a pair of an opening tag < ... > and an closing 
tag < /... >, between which the element content is inserted. In the previous 
example, "plant", "category", "tissue", "nsmie" , "def", and "note" are 
labels {tag names in the XML jargon). 

The XML specification does not fix a priori the set of allowed labels in an 
XML document nor it defines any semantics for labels. Only well-formedness 
conditions are defined in particular to ensure proper nesting of elements, which 
allows to consider XML documents as trees. For instance. Figure 1.1 gives 
a more visual tree representation of the previous well-formed sample XML 
document. 




Phloem The(...) In trees(...) 



Figure 1.1: Sample Tree of a Well- Formed Document. 

The set of labels occurring in an XML document is determined by schemas 
that can freely be defined by users. A schema (also called an XML type) is a 
description of constraints on the structure of documents such as allowed labels 
and their possible nesting structures. A schema thus defines a class of XML 
documents. Two levels of correctness can therefore be distinguished for XML 
documents: 

• well-formedness which applies to documents that obey the necessary and 
sufficient syntactic condition (defined by the XML specification) for being 
interpreted as trees; 
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• validity which applies to documents that conform to the additional con- 
straints described by a given schema. 

The validity of a document implies its well-formedness since the schema 
describes constraints on the tree and not on the text representation of the 
XML document. 

Each application can define its own data format by defining schemas, at a 
higher abstract level (tree structures). In that sense, XML is often said to be 
a metalanguage or a "format for data formats" . 

Separating the two levels of correctness allows applications to share generic 
software tools for manipulating well-formed XML documents (parsers, editors, 
query and transformation tools...). These tools all implement the same syn- 
tactic conventions defined by the XML specification (such as the way of in- 
cluding comments, external fragments, special characters...). XML thus allows 
a first level of processing on an XML document as soon as it is well-formed, 
without making the additional and much stronger hypothesis that it is valid 
w.r.t to some schema. This genericity is one of XML strengths. As a conse- 
quence, we have seen unprecedented speed and range in the adoption of XML. 
A large number of schemas have been defined and are actually widely used in 
practice, for instance: XHTML (the XML version of HTML), SVG (for vec- 
tor graphics), SMIL (for synchronized multimedia documents), MathML (for 
mathematical formulas), SOAP (for remote procedure calls), XBRL (for finan- 
cial information) , FIX (for securities transactions) , SMD (for music) , X3D (for 
3D modeling) and CML (for chemical structures) . 

1.1.2 XPath 

XPath [Clark and DeRose, 1999, Bcrgiund ct al., 2006] has been introduced by 
the W3C as the standard query language for addressing and retrieving infor- 
mation in XML documents. It allows to navigate in XML trees and return a 
set of matching nodes. As such, XPath forms the essence of XML data access. 

In their simplest form XPath expressions look like "directory navigation 
paths" . For example, the XPath expression 

/book / chapter / section 

navigates from the root of a document (designated by the leading slash "/") 
through the top-level "book" nodes, to their "chapter" child nodes, and on to 
their child nodes named "section" . The result of the evaluation of the entire 
expression is the set of all the "section" nodes that can be reached in this 
manner. Furthermore, at each step in the navigation the selected nodes can be 
filtered using qualifiers. A qualifier is a boolean expression between brackets 
that can test the existence or absence of paths. So if we ask for 

/book /chapter / section [citation] 

then the result is all "section" elements that have a least one child element 
named "citation" . The situation becomes more interesting when combined with 
XPath's capability of searching along "axes" other than the shown "children 
of" axis. Indeed the above XPath is a shorthand for 

/ child : : book/ child : : chapter / child : :section [child : :cit ation] 
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where it is made explicit that each path step is meant to search the "child" axis 
containing all children of the nodes selected at previous step. If we instead 
asked for 

/child : : book /descendant : : * [child : : citation] 

then the last step selects nodes of any kind that are among the descendants 
of the top element "book" and have a "citation" sub-element. One may also 
use other axes such as "preceding-sibling" for navigating backward through 
nodes of the same parent, or "ancestor" for navigating upward recursively (see 
Figure 1.2). Document order is defined as the order in which a depth-first 
tree traversal visits nodes. Axes that perform navigation in reverse document 
order are called reverse axes (or alternatively backward or upward axes in the 
literature) . 

Previous examples are absolute XPath expressions as they start with a "/" 
which refers to the root. The meaning of a relative expression (without the 
leading "/") is defined with respect to a context node in the tree. The context 
node simply refers to the tree node from which navigation starts. Starting from 
a particular context node in a tree, every other nodes can easily be reached: 
XPath axes define a partitioning of a tree from any context node. Figure 1.2 
illustrates this on a sample tree. More informal details on the complete XPath 
standard can be found in the W3C specification [Clark and DcRos<\ 199!)]. 

XPath is increasingly popular due to its expressive power and its com- 
pact syntax. These two advantages have given XPath a central role both in 
other key XML specifications and XML applications. It is used in XQuery 
[Boag ct al., 2006] as a core query language; in XSLT [Clark, 1999] as node se- 
lector in the transformations; in XML Schema [Fallsidc and Walmslcy, 2004] to 
define keys; in XLink [DcRosc ct al, 2001] and XPointer [DcRosc ct al., 2002] 
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to reference portions of XML data. XPath is also used in many applications 
such as update languages [Sur ct al., 2UU4] and access control [Fan ct al., 2004]. 



1.1.3 Static Type-Checking 

XML applications most commonly use schemas for performing validation (also 
called dynamic type- checking). Validation consists in using a schema validator 
that analyzes a particular XML document w.r.t a given schema in order to 
ensure that the document actually conforms to the expectations of the appli- 
cation. 

In practice however XML documents are often generated dynamically by 
some program. Typically, programs that manipulate XML first access data 
(possibly conforming to an available schema) using XPath expressions, and 
then build and return an output XML document intended to conform to a 
given schema. 

An ambitious approach is the static type- checking of these programs, which 
consists in ensuring at compile-time that invalid documents can never arise as 
outputs of XML processing code. A static type checker analyzes a program, 
possibly in conjunction with schemas that describe its input and output (de- 
pending whether such schemas are available). The problem's difhculty is a 
function of the language in which the program and the schemas are expressed. 

Schema languages have been extensively studied and are now well under- 
stood as subsets of regular tree languages [Murata ot al., 20(J5]. However, al- 
though many attempts have been made for better understanding static type- 
checking techniques, in particular through the design of domain specific lan- 
guages [Hosoya and Fierce. 200.')], no approach is effectively able to deal with 
XFath, which nevertheless remains the essence of XML navigation and data 
access. 



1.1.4 Research Challenges 

The reason for the limitations of existing approaches is the difficulty of XFath 
static analysis. It is known that the static analysis of the complete XFath 
standard is undecidable. Importance and range of applications nevertheless 
motivate research questions: what is the largest XFath fragment with decidable 
static analysis? Which fragments can be effectively decided in practice? How to 
determine if an XFath expression is satisfiable on any of the XML trees defined 
by a given schema? How to know if two XFath queries will always yield the 
same result when evaluated on a document valid w.r.t. a given schema? Does 
the result of an XFath expression over a valid document always conform to 
another schema? Is there an algorithm able to answer these questions in an 
efficient way so that it can be used in practice? 

One source of difficulty for such an algorithm is that it needs to check 
properties on a possibly infinite quantification over a set of trees. A variety 
of factors furthermore contribute to its complexity such as the operators al- 
lowed in XFath queries and the combination of them (cf. Chapter 2.2). A 
consequence of these difficulties is that such research questions are still open. 
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1.2 Overview of this Dissertation 

This dissertation starts from the idea that for deciding XML problems, two 
issues must be addressed. First, identify an appropriate logic with sufficient 
expressiveness to capture both regular tree types and XPath style navigation 
and node selection semantics. Second, solve efficiently the satisfiability problem 
which allows to test if a given formula of the logic admits a satisfying XML 
document as a model. 

1.2.1 Applications 

The main application of this work is the static analysis of programs manip- 
ulating XML data and documents. This dissertation provides the necessary 
foundations and system implementations for solving the major XML decision 
problems that naturally arise from such static analyses. 

The most basic decision problem for a query language is the emptiness 
check [lieuedikt ct al., 2UU5]: whether or not an expression yields a non-empty 
result. XPath emptiness is important for optimization of host languages im- 
plementations: for instance, if one can decide at compile time that a query is 
not satisfiable then subsequent bound computations can be avoided. 

Another basic decision problem is the XPath equivalence problem: whether 
or not two queries always return the same result. It is important for reformula- 
tion and optimization of the query itself [Geneves and Vion-Dury, 2004], which 
aim at enforcing operational properties while preserving semantic equivalence 
[Abiteboul and Vianu, 1999, Levin and Pierce, 2005]. 

The most critical problem for the type-checking of XML transformations 
is XPath containment: whether or not, for any tree, the result of a particular 
query is included in the result of another one. It is required for the control-flow 
analysis of XSLT [M0ller et al. , 2005] . It is also needed for checking integrity 
constraints [Fallside and Walmsley, 2004], and for checking access control in 
XML security applications [Fan et al., 2004]. 

Other decision problems needed in applications include for example XPath 
overlap (whether two expressions select common nodes) and coverage (whether 
nodes selected by an expression are always contained in the union of the results 
selected by several other expressions). 

This dissertation effectively solves these problems in the presence, or ab- 
sence, of XML type constraints such as DTDs [Bray et al., 2004] or XML 
Schemas [Fallside and Walmsley, 2004] . This makes possible to ensure valuable 
properties (such as type-safety and optimizations) at compile-time, toward safer 
and more efficient runtime XML processing. Results presented in this disserta- 
tion thus notably open promising perspectives for the effective static analysis 
of XML transformations. 

1.2.2 Outline 

The first part of this dissertation is dedicated to state-of-the-art related tools 
and techniques. Chapter 2 introduces some known theoretical foundations 
and formalisms used in the remaining of this dissertation, while progressively 
introducing related work. 
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In a second part, Chapter 3 and Chapter 4 conduct preliminary investiga- 
tions with known logics in the context of XML. Specifically, Chapter 3 studies 
to which extent monadic second order logic can be used in practice, despite 
its high complexity, for solving XML static analysis problems such as XPath 
containment. Chapter 4 introduces the /x-calculus as a powerful replacement 
for monadic second order logic, and studies its use for XML reasoning. 

Based on the lessons learned from these investigations, the third part of this 
dissertation presents the final contribution. Chapter 5 proposes a logic of finite 
trees specifically designed for XML. Chapter 6 describes a proposed algorithm 
for testing the satisfiability of the logic, along with implementation techniques. 
Finally, Chapter 7 concludes this dissertation and gives several perspectives. 
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Chapter 2 

Foundations of XML Processing 



In this chapter, some known theoretical foundations and formahsms used in the 
following chapters of this dissertation are introduced. State of the art related 
work is presented as underlying concepts are progressively introduced. 

2.1 Trees and Tree Types 

This section introduces the formal models of XML documents and schemas 
most often considered in the literature as well as in Chapters 2, 3, and 4 of this 
dissertation ^. 

2.1.1 Finite Trees and Hedges 

An XML document can be seen as a finite ordered and labeled tree of un- 
bounded depth and arity. Since there is no a priori bound on the number 
of children of a node; such a tree is therefore unranked [Sewn. 2()02b]. Tree 
nodes are labeled with symbols taken from a countably infinite alphabet S. 
There is a straightforward isomorphism between sequences of unranked trees 
and binary trees [ITosoya and Pificc. 200:), Ncven. 20021)]. In order to describe 
it, trees are first formally defined. An unranked tree is defined as cr(/i) where 
cr G S and h is a hedge, i.e. a sequence of unranked trees, defined as follows: 

Tis 3 h ::— hedge 

cr(/i), h' non-empty sequence of trees 
I empty sequence 

The set of unranked trees is denoted by T£ . A binary tree t is either a cr-labeled 
root of two subtrees (tr G E) or the empty tree: 

3 t ::= binary tree 

a{t, t') node 
I e empty tree 



^Chapter 5 elaborates further on this model by introducing focused trees. 
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Unranked trees are translated into binary trees with the following function /3(-): 

/?(•) : ^ r| 

(3{a{h),h')'^'a{m,m')) 

The inverse translation function /3^^(-) converts a binary tree into a se- 
quence of unranked trees: 

For example, Figure 2.1 illustrates how the sample tree a{b, c, d) is mapped 
to its binary representation a(&(e, c(e, d(e, e))), e) and vice-versa. 




Figure 2.1: Unranked and Binary Tree Representations. 

Note that the translation of a single unranked tree results in a binary tree 
of the form a(t,e). Reciprocally, the inverse translation of such a binary tree 
always yields a single unranked tree. When modeling XML, it is therefore pos- 
sible to focus on binary trees of the form a{t, e), without loss of generality. The 
following section presents how this isomorphism between binary and unranked 
trees also extends to tree types. Such binary mappings allow to simplify formal 
notations used in the remaining. 

2.1.2 Schema Languages and Regular Tree Types 

Schemas describe structural constraints for XML documents. There are many 
formalisms (called schema languages) for specifying schemas (or "types"). For 
instance: DTD, which is part of the XML specification [Br.n- d nl.. 2004], 
XML Schema (W3C) [Fallside and Walmsley, 2004], and RELAX NG (OA- 
SIS/ISO) [Clark and Murata, 2001] are actively used by various applications. 
Each schema language has different constraint mechanisms and different ex- 
pressivenesses. A detailed characterization of each schema language can be 
found in [Murata ct al., 2005]. No current schema language goes beyond the 
expressive power of regular tree languages. From an XML point of view, reg- 
ular tree types form a strict superset of standards such as XML Schemas and 
DTDs (cf. Figure 2.2). Therefore, in this dissertation, regular tree languages 
are considered as the general mechanism for typing XML documents. 
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Figure 2.2: Relative Expressiveness of Schema Languages. 

A tree type expression T is syntactically defined as follows: 

Celt ^ T ::~ context-free tree type expression 






empty set of trees 





empty sequence 


X 


variable 


llTl 


label 




sequence 


Ti 1 T2 


disjunction 


let Xi.T^ in T 


n-ary binder 



where I G S and X G TVar assuming that TVar is a countably infinite set of 
type variables. Abbreviated type expressions can be defined as follows: 

T? I r 

T* =' let X.T mT,X \ () 
T+ =' T, T* 

Given an environment 9 of type variable bindings, the semantics of tree types 
is given by the denotation function l-Jg: 

II : Ccft X {TVar 2^^") ^ 2^^" 

101/- 

101/= {()} 
lXje = 0{X) 

ll[Tme'^'{nt)\l' ^iMGlTje} 
lTi,T2le = {ii,i2 | h G [Tile A<2 G ^le} 

m I T^je = mi, U me 

[let M in Tle = lThf^(^s) 
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where is a global subtagging relation: a reflexive and transitive relation on 
labels^, and S{9') = 9[Xi i-^ lTile']i>i- Note that each function S is monotone 
according to the ordering C on TVar ^ and thus has a least fixpoint 

MS). 

Types as deflned above actually correspond to arbitrary context-free tree 
types, for which the decision problem for inclusion is known to be undecid- 
able [Hopcroft ct al.. 2000]. An additional restriction is imposed to reduce the 
expressive power of considered types so that they correspond to regular tree 
languages. The restriction (also used in [Hosoya ct al, 200-51)]) consists in a 
simple syntactic condition that allows unguarded (i.e. not enclosed by a label) 
recursive uses of variables, but restricts them to tail positions'^. This condition 
ensures regularity, and the resulting class of regular tree languages is denoted 

£rt- 



Document Type Definitions 



This subsection further details the connection between regular tree types and 
the widely used DTD standard. As they are defined in the W3C recommenda- 
tion, DTDs [Bray rt al.. 2004] are local tree grammars^, which are strictly less 
expressive than regular tree types. In the XML terminology, a type expression 
is called the content model. DTD content models are described by the following 
syntax: 



T ::= DTD tree type expression 

I label 

I Ti \T2 disjunction 

I Ti,T2 sequence 

I r? optional occurrence 

I T* zero, one or more occurrences 

I r+ one or more occurrences 

I empty sequence 



where I G S. From the W3C specification, a DTD can be seen as a function 
that associates a content model to each label taken from a subset S' of S, such 
that E' gathers all labels used in content models. The set £dtd of tree types 



^Subtagging goes beyond the expressive power of DTDs but a similar notion called "sub- 
stitution groups" exists in XML Schemas (see [Hosoya ct al.. 20051)] for more details on 
subtagging) . 

^For instance the type "let X,Y^.a[],Y^ in b[],X | {)X" is allowed. 

*A local tree grammar is a regular tree grammar without competing non-terminals. Two 
non-terminals A and _B of a tree grammar are said to compete with each other if one pro- 
duction rule has A in its left-hand side, one production rule has B in its left-hand side, and 
these two rules share the same terminal symbol in the right-hand side. 
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described by DTDs can thus be represented as follows: 



-dtd 



3 T 



DTD tree type expression 

I label 

Ti I T2 disjunction 

Ti , T2 sequence 

Tl optional occurrence 

T* zero, one or more occurrences 

T"*" one or more occurrences 

empty sequence 

let li-Ti in T n-ary binder 



Note that £dtd Q 'Crt is obvious, by associating a unique type variable to each 
label. In the following, DTDs are therefore not distinguished from general 
regular tree types anymore. 



2.1.3 Binary Tree Types 

Section 2.1.1 presented a straightforward isomorphism between binary trees 
and sequences of unranked trees. There is also an isomorphism between un- 
ranked and binary tree types, which follows exactly the same intuition as for 
trees. 

Binary tree types are described by the following syntax: 

£bt 3 T ::— binary tree type expression 

empty set of trees 

I empty sequence 

I Ti\T2 disjunction 
I l{Xi,X2) label 
I let Xi.Ti in T n-ary binder 

For any type, there is an equivalent binary type, and vice-versa. The trans- 
lation function B{-) shown on Figure 2.3 (and adapted from the one found 
in [Hosoya ct al., 200-5b]) is used to convert a type into its corresponding bi- 
nary representation. The function considers the environment 6 : TVar 

for accessing the type bound to a variable Xi by constructs of the form 
"let XjTTi in T" . 

For example. Figure 2.4 gives a sample DTD that validates the well-formed 
XML document presented in Section 1.1.1 of Chapter 1. The corresponding 
context-free tree type expression is presented on Figure 2.5. It uses 14 type 
variables (preceded by a dollar sign $ by convention). Figure 2.6 shows its 
translation into binary tree type syntax. 



2.1.4 Finite Tree Automata 

Tree automata are a convenient operational formalism for expressing the notion 
of tree languages. A language is recognizable if there exists an automaton which 
recognizes trees of the language. A detailed classification of tree automata 
and associated results on the recognizability of tree languages are presented in 
[Comon ct al., 1997]. This section presents the most basic results on finite tree 
automata needed for the remaining of this dissertation. 
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D(-) 


• •'-rt > ■'-bt 


R((h\ 


— W 




def 
— 6 


D(A. ) 






— ' lot K(T\ F„ in 




— ^Ulj 1 "U2j 


S(let Xi.Ti in T) 


= let Xi.B{Ti) in 6(r) 


^(0,T) 


= 


6(0, T) 


= 6(T) 


B{X,T) 


= ,B(6'(X),T) 


Bm],T,) 


= let Xi.B{Ti),X2.BiT2) in /(Xi,X2) 


B{{T,\T2),T3) 


= 6(Ti,T3) |6(T2,r3) 


B{{Ti,T2),T3) 


'='s(Ti,(r2,r3)) 


B{let Xi.Ti in T,T') 


= let Xi.B(Ti) in B(T,T') 


Figure 2.3: 


Binarization of Tree Types. 



plcint (category?, tissue*, ph.ylogeny?)> 

category (#PCDATA)> 

tissue (ncmie+, def, note?)> 

name (#PCDATA)> 

def (#PCDATA)> 

note (#PCDATA)> 

phylogeny (plant+)> 

Figure 2.4: A Sample DTD. 



Bottom-Up Finite Tree Automata Formally, a bottom-up non-determi- 
nistic finite tree automaton (NFTA) over an alphabet S of node labels is a tuple 
{Q, Qf,T) where Q is the set of states, Q/ C Q is a set of accepting states, and 
r is a set of transitions. Transitions are either of the form g <— cr or of the form 
q" <— a{q, q'), depending on the arity of the symbol (t G S (respectively a leaf 
or a binary constructor) and where q,q',q'' are automaton states belonging to 
Q. A bottom- up NFTA starts from the leaves and moves up the tree. At each 
step of the execution, a state is inductively associated with each subtree. The 
tree is accepted if the state labeled at the root is an accepting state. 



< ! ELEMENT 
<! ELEMENT 
< ! ELEMENT 
< ! ELEMENT 
<! ELEMENT 
< ! ELEMENT 
< ! ELEMENT 
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$Empty -> EMPTYSET 

$Epsilon -> 

$Any -> 

$PCData -> 

$name -> name($PCData) 

$iiote -> note($PCData) 

$1 -> $plant I $plant, $1 

$phylogeny -> phylogeny ($1) 

$category -> category ($PCData) 

$def -> def($PCData) 

$2 -> $name I $naiiie, $2 

$tissue -> tissue ($2, $def, () | $note) 

$3 -> I $tissue, $3 

$plant -> plant (() I $category, $3, () I $phylogeny) 

Start symbol is $plcint 
14 type variables. 
7 terminals. 

Figure 2.5: Sample Context-Free Tree Type Expression. 



$2 -> plant ($1, $Epsilon) I plant ($1, $2) 
$7 -> EPSILON I note($Epsilon, $Epsilon) 
$5 -> def ($Epsilon, $7) 

$3 -> name($Epsilon, $5) I name ($Epsilon, $3) 

$10 -> EPSILON I phylogeny ($2, $Epsilon) | tissue($3, $10) 

$1 -> EPSILON I phylogeny ($2, $Epsilon) | 

tissue ($3, $10) I category ($Epsilon, $10) 
$plant -> plant ($1, $Epsilon) 

Start symbol is $plcint 
7 type variables . 
7 terminals. 

Figure 2.6: Sample Binary Tree Type Expression. 
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Top-Down Finite Tree Automata There exists a symmetric counterpart 
of bottom-up NFTA called top-down NFTA, wiiich correspond to the alternate 
direction used to recognize a tree. A top-down NFTA {Q,Qi,T) starts at the 
root and moves down to the leaves. Based on a state and a current node in the 
tree, a new state is inductively associated with each subtree. Transitions thus 
have the reverse form, and Qi is the set of initial states. The tree is accepted 
if every branch can be gone through this way. 

Determinism A deterministic finite tree automaton (DFTA) is one where 
no two transition rules have the same left-hand side. This definition matches 
the intuitive idea that for an automaton to be deterministic, one and only one 
transition must be possible for a given node. 

Expressive Power Top-down and bottom-up NFTA are equivalent (the 
transition rules are simply reversed, and the final states become the initial 
states). However, top-down DFTA are strictly less powerful than their de- 
terministic bottom-up counterparts. This is because transition rules of tree 
automata can be seen as rewrite rules; and for top-down ones, the left-hand 
sides correspond to parent nodes. Consequently a deterministic top-down tree 
automaton will only be able to test for tree properties that are true in all 
branches, because the choice of the state to write into each child branch is 
determined at the parent node, without knowing the child branches contents. 

Every bottom-up NFTA is equivalent to a bottom-up DFTA which can 
be obtained by the process of determinization. Determinization relies on the 
"subset construction" and the number of states of the equivalent DFTA can be 
exponential in the number of states of the given NFTA (see [Comon et al., 1997] 
for the detailed algorithm). In the bottom-up paradigm, since NFTA and 
DFTA accept the same sets of tree languages, they are usually not distinguished 
and simply both referred as finite tree automata (FTA). 

FTA are equivalent to regular tree types and therefore have the same ex- 
pressiveness. 

FTA as XML Types Murata was the first to consider tree automata as a 
schema definition language [Murata, 1998]. Since then, FTA were heavily used 
in many research works for modeling XML types [Ncv(ni. 2002a]. In fact, the 
schema language Relax NG [Clark and Murata, 2001], a competitor of XML 
Schema [Fallsido and Walmslcy, 2004] (itself introduced as a replacement for 
DTDs [Bray ct al., 2004]) is even directly inspired by FTA. A detailed com- 
parison of these schema languages based on formal language theory is provided 
in [Murata ct al., 200-5]. 

As a simple example. Figure 2.7 illustrates a sample NFTA which accepts 
the set of trees defined by the DTD shown on Figure 2.4. The NFTA accepts 
the set of all binary trees P{t) such that the unranked tree t is validated by the 
DTD of Figure 2.4. Note that the NFTA can be seen as another notation for 
the binary tree type expression shown on Figure 2.6. More interestingly, the 
DFTA obtained by determinization of this NFTA can be seen as the operational 
validator of the DTD. 
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Q = {'7l,92,g3,'?5,g7,giO,ge,gplant} 



lyplant J 




q2 


- plant (gi,^,:) 


32 ^ 


- plant ((71,92) 


97 ^ 


— e 


37 ^ 


- note(ge,ge) 


35 ^ 


- def(g£,g7) 


33 ^ 


- name((j<:,95) 


33 ^ 


- name((je,33) 


< 310 ^ 


- e 


310 ^ 


- phylogeny((72,3£) 


310 ^ 


- tissue(q3,(7io) 


31 ^ 


- e 


31 ^ 


- phylogeny(g2,3e) 


31 ^ 


- tissue(q3,(7io) 


3i ^ 


- category (ge,gio) 


Qplant ^ 


- plant (J,:) 



Figure 2.7: A Sample NFTA {Q,Qf,T). 



Closure Properties One of the main advantages of FTA (compared to 
DTDs for instance) is their closure under set theoretic operations such as union, 
intersection, and complementation [Coinoii et al., 1997]. 

The union of two tree automata is trivially built: let Ai = {Qi,Q fi,^i) and 
^2 = {Q2, Q f2 1 ^2) be two FTA. Since states of a FTA may be renamed without 
loss of generality, it is assumed that Qi H Q2 = 0. It is then straightforward 
to verify that Ai U ^2 = {Q,Qf,T) defined by: Q = Qi U Q2, Qf = Qf, UQf^ 
and r Fi UFa. 

Similarly, the intersection of two tree automata Ai = (Qi, (3/i,Fi) and 
^2 = iQ2,Q f2^^2) is simply obtained by calculating a product automaton: 

AinA2^ {Qi X Q2,Qf, X Q/,,Fi x F2) 

Complementation of a complete DFTA simply consists in flipping accepting 
and rejecting states. Note that a DFTA {Q,Qf,T) is complete if and only 
if there is a transition q" ^ a{q,q') for each cr e S and {q,q',q") G Q^. 
Completing an automaton (e.g. adding new missing states and transitions, 
and then possibly updating the final set of states [< 'oinoii ct al.. 1997]) may be 
required before complementing it. The complement of a FTA A is noted C(v4). 

Containment for FTA By taking advantage of these closure properties, 
it is possible to check the containment of two FTA Ai and A2 (determining 
whether the set of trees accepted by Ai is included into the set of trees accepted 
by A2) as the emptiness check of the FTA Ai n 0(^2). 

It can be decided in linear time whether the language accepted by a FTA 
is empty (see [C'oinon ct al., 1997] for details). However, complementation re- 
quires determinization of the tree automaton, which may cause an exponential 
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increase of the number of states in the worst case [Comon et al., 1997]. Thus 
this technique has exponential time complexity. Essentially, there is no bet- 
ter way for checking containment between two FTA. As a result, the FTA 
containment problem is in EXPTIME'^ [Scidl, 1990]. 

2.2 Queries 

Most queries used in the context of XML are either boolean or unary. Boolean 
queries give a yes/no answer on a tree (for instance the validation of an XML 
document w.r.t to a DTD is a boolean query). Unary queries select nodes 
from a document (for instance, finding the set of nodes selected by an XPath 
expression is a unary query). 

Unary queries considered in this dissertation are among those defined by 
the powerful XPath standard introduced in Section 1.1.2. The static analy- 
sis of XPath queries is a hard problem that has recently attracted a lot of 
theoretical research attention. In particular, the computational complexity 
of the containment problem for XPath expressions has received much atten- 
tion from the database community [Dcutsch and Tamicn, 2001, Wood, 2003, 
Ncvcn and Schwcntick, 2003, Schwentick, 2004, Miklau and Suciu, 2004]. The 
complexity of the emptiness problem for XPath expressions has also been stud- 
ied in [Brncdikt rt al.. 200'i]. One source of difficulty for such decision problems 
is that they need to be checked on a possibly infinite quantification over a set of 
trees. A variety of factors also contribute to their complexity such as the oper- 
ators allowed in XPath queries and the combination of them. For instance, one 
difficulty arises from the combination of upward and downward navigation on 
trees with recursion [i anli. 1998]. Actually, when the whole XPath language 
is considered, decision problems such as containment and emptiness are unde- 
cidable. Therefore, in the literature, the focus was given to identifying major 
XPath features and studying their impact on the complexity of XPath decision 
problems. The distinctions between major features studied in the literature 
(extended from [Bcncdikt ct al., 2005]) follow: 

• positive vs. non-positive: depending whether the negation operator is 
considered (positive) or not (non-positive) inside qualifiers. 

• downward vs. upward: depending whether queries specify downward or 
upward traversal of the tree, or both. 

• recursive vs. non-recursive: depending whether XPath transitive closure 
axes (for instance "descendant" or "ancestor") are considered or not. 

• qualified vs. non-qualified: depending whether queries allow filtering 
qualifiers or not. 

• with vs. without data values: depending whether comparisons of data 
values expressing joins are allowed or not. 

• with vs. without counting: depending whether counting of tree nodes is 
allowed or not. 

^The complexity class EXPTIME is the set of all decision problems solvable by a deter- 
ministic Turing machine in 0(2'''")) time, where p(n) is a polynomial function of the input 
size n. 
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Several XPath fragments combining only a few of these features have been 
studied: see [Sc liwontick. 2004] for an overview. From these results, it is known 
that containment and satisfiability for (reasonably) restricted XPath fragments, 
even without type constraints, ranges from EXPTIME to undecidable. How- 
ever, techniques used for obtaining computational complexity bounds over spe- 
cific subfragments do not scale when additional features are considered, and 
thus give no hints on how to address more realistic fragments. At the time 
of this dissertation, no relevant algorithm effectively able of answering realistic 
XPath decision problems in acceptable time and space bounds is known. XPath 
decision problems have been partially characterized from a strict computational 
complexity point of view, and remain unsolved in practice. 



2.2.1 Syntax of XPath Expressions 

In this dissertation, particular attention is paid at supporting a large XPath 
fragment, as realistic as possible, covering major features of the XPath stan- 
dard [Clark and DcRosc, 199!;]. The syntax of considered XPath expressions 
is given on Figure 2.8. The considered XPath fragment is non-positive, both 
downward and upward, recursive, qualified, and also includes union and inter- 
section. It includes all axes. This is the largest fragment considered so far in 
the literature. It covers all major XPath features except counting and data 
values. The integration of counting is kept for future work, based on related 
work on logics for counting [Dal-Zilio et al., 2004]. Data values are known to 
cause undecidability of XPath containment when combined with previous fac- 
tors [Bcncdikt ct al., 2005, Scliwcntick, 2004]'^. 

2.2.2 XPath Denotational Semantics 

In the classical denotational semantics of paths, first given in [Wadlcr, 2000], 
the evaluation of an XPath expression over an XML document t returns a 
set of nodes reachable from a context node x. The denotational semantics 
of the considered XPath fragment (adapted from [Wadlcr, 2000]) is given by 
the formal semantics function Se which defines the set of nodes returned by 
expressions, starting from a context node x in the tree: 

Sell ■■ -Cxpath ^ Node ^ Sct{Node) 
S4/p}x = SpMroot{) 

SeMx = Splpjx 

Sefei I e2]a; '^'5e|ei]a: U 5e|e2la; 
Sefei n e2lx Seleijx n 5e|e2la; 



^Note however that the very recent work found in [Bojanczyk ct al., 2006] obtained the 
theoretical decidability (between NEXPTIME and 3-NEXPTIME) for a limited form of data 
value comparison. Integration of such restricted comparisons in the considered fragment and 
the effective algorithm presented in Chapter 6 is one of the perspectives of this dissertation. 
At least an additional exponential time blow-up is however expected. 
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•^XPath 3 e 



Path p ::= 



Qualif q 



Axis a 



Iv 

V 

ei I 62 
ei n 62 

Pl/P2 

a::a 



qi and 92 
qi or 92 
not g 
P 

child 
self 
parent 
descendant 
descendant-or-self 
ancestor 
ancestor-or-self 
following-sibling 
preceding-sibling 
following 
preceding 



XPath expression 
absolute path 
relative path 
union 

intersection 
path 
path composition 
qualified path 
step with node test 
step 
qualifier 
conjunction 
disjunction 
negation 
path 

tree navigation axis (see Figure 1.2) 



Figure 2.8: XPath Abstract Syntax. 



The formal semantics function Sp defines the set of nodes returned by paths: 

Spll : Path Node -> Sct{Node) 
Splpi/P2lx {x2 I 2^1 G 5p|pi]a; A X2 E Splp2lxi} 
^pMq]}^ = {^1 I 2^1 e Splpjx A 
5p|a::(T]a; {xi \ x\ e 5a|a]a; A name{x\) = a] 
Sp\a.::*\x {xi I xi G 5ala]x} 

Note that the semantics of the p\ /p2 construct corresponds to composition 
of unary queries. In this sense, XPath is fundamentally different from regu- 
lar expressions patterns a la Hosoya [Hosoya and Pierce, 2001] that rather use 
pattern-matching techniques. The function Sq defines the semantics of quali- 
fiers that basically state the existence or absence of one or more paths from a 
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context node: 

iSq|-]- : Qualifier — > Node — > Boolean 
Sqlqi andg2la; =^ Sglqijx A 5,19212; 
Sqlqi orq2lx = Sqfqijx V Sqlq2jx 
Sqlnot qjx 5,1^1 X 

SqMx = sM^^^ 

The semantics of paths relies on the navigational semantics of axes, given by 
the function Sa' 

Sail ■ ^2;zs Node Set{Node) 

iSa|child]a; == children{x) 

iSa|parent]a; parent (x) 

iSa [descendant] a; children'^ (x) 

iSa [ancestor] a; parent^ (x) 

54sel^a; = W 

5a|descendant-or-self|a; 5^ [descendant] a; U 5a[self|a; 

iSa[ancestor-or-self|a; = 5a [ancestor] x U 5o[self|x 

iSa [preceding] a; {j/ | y ^ a;} \ 5^ [ancestor] x 

5a [following] X {y I X ^ y} \ 5a [descendant] x 

5a [following-sibling] X | y G child {parent (x)) A x <C y} 

5a [preceding-sibling] X | y G child {parent (x)) A y ^ x} 

Path and axis navigation (illustrated on a sample tree by Figure 1.2) relies on a 
few assumed primitives over the XML tree data model: root{) returns the root 
of the tree; children(x) which returns the set of nodes which are children of the 
node x; parent(x) which returns the parent node of the node x; the relation 
<C which defines the ordering: x ^ y holds if and only if the node x is before 
the node y in the depth-first traversal order of the n-ary XML tree; and finally 
namei) which returns the labeling of a node. 

2.3 Logical Formalisms: Two Yardsticks 

Unranked trees defined in Section 2.1.1 can be viewed as logical structures, 
in the sense of mathematical logic [Ebbinghaus and Flum, 2UU5]. In this vi- 
sion, the domain of a tree t, viewed as a structure, is the set of nodes of t, 
denoted by Dom(<). Formally, Dom(t) is the subset of N* defined as follows: 
if t = cr(fi, i„) with cr G S, 71 > and S , then Dom(t) = 

{e} U {iu I z e {1, ...,n},u £ Dom(t,;)}. Thus, e represents the root while vj 
represents the j*"^ successor of v. 

A relational vocabulary (^chi ^sbi {Oa \ f € 5]}) is often used [Novcn, 2U02a, 
Barcclo and Libkin, 2005, Bojanczyk et al., 2006]. In this vocabulary, the Oa 
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are unary relation predicates. For each a label in the alphabet S, is the 
set of nodes that are labeled with u. The symbols ^ch and ^sb are binary 
predicates. The symbol ^ch is interpreted as the child relation: the set of pairs 
{v,v ■ i) where v,v ■ i € Dom(i). The symbol -<sb is the sibling order: the set 
of pairs (w • i, w • (i + 1)) where v ■ i^v ■ {i + 1) S Dom(t). 

Classically, -<*^^ is defined as the transitive-reflexive closure of (the de- 
scendant/ancestor relationship between two nodes), and ^*|^ as the transitive- 
reflexive closure of ^sb (the linear ordering on siblings). 

Most formalisms used in the context of XML are related to one of the 
two logics used over these relational structures: first-order logic, and monadic 
second order logic: 

• first-order logic and relatives are frequently used for query languages since 
they nicely capture their navigational features presented in the previous 
Section 2.2.2. 

• monadic second order logic, which extends first-order logic by quantifi- 
cation over sets of nodes, is one of the most expressive (yet decidable) 
known logic. One of its main advantages in the context of XML is its 
ability to fully support XML types (regular tree languages). 

The next sections are dedicated to these two logical formalisms, which are used 
as yardsticks logics in the XML setting. First-order logic is denoted by FO, 
and monadic second order logic by MSO. For XML applications, the relational 
vocabulary contains at least the labeling predicates for G S, which are 
thus omitted from notations in the remaining. The rest of the vocabulary is 
listed between brackets. For example, MSO[-<ch, ^sb] refers to the vocabulary 
(^ch, ^sb, {^(T I c G S}). An important distinction between MSO and FO is 
that -<*j^ and -<*j^ are definable from -<ch and -<sb in MSO (using second-order 
quantification) but not in FO. 

2.4 First Order Logic 

Over a general relational structure, FO is undecidable, while its two-variable 
fragment is decidable [Mortimor, 1975]. Therefore, restricting FO to its two- 
variable fragment, denoted FO^, has become a classical idea when looking for 
decidabiUty [Gradel and Otto, 1999]. Furthermore, since ^*^^ and arc not 
definable from ^ch and ^sb in FO, FO^[^*jj, ^*i,] is generally considered. 

From the work found in [Gcncvc-s aiul Vion-Dury. 2004] and [M;irx. 2004a], 
it is known that XPath expressive power is close to FO^[^*jj, -<*,j] that cap- 
tures its navigational behavior. Specifically, in [Geneves and V ion-Dury. 2004], 
a FO^[-<*[j, ^*|_,] interpretation of an XPath fragment is given and proven cor- 
rect w.r.t. to XPath denotational semantics presented in Section 2.2.2. The 
work found in [Marx, 2004a] characterizes the navigational fragment of XPath 
(introduced as "Core XPath" in [Golllol) c\ al.. 200")]) and shows how it can 
be extended in order to be complete with respect to FO^[^*jj, -<*b]- 

The very recent work found in [Bojaiiczyk ct al., 200()] proves the decid- 
ability of FO^[-<ch, ^sb, ^] where is a binary predicate such that x y holds 
for two nodes if they have the same data value. A consequence is the theo- 
retical decidability of a limited form of comparison of data values in XPath. 
The corresponding decision procedure is observed to be between NEXPTIME 
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and 3-NEXPTIME, but unfortunately the approach gives no ckie for a relevant 
effective algorithm [Bo janczyk ct al., 2000]. 

FO nevertheless remains a convenient formalism for obtaining decidability 
results or theoretical characterizations of XPath queries. However, an argu- 
ment in favor of MSO is that FO and its variants do not fully capture regular 
tree types [Bcncdikt and Scgoufin, 2005] which make them unsuited for dealing 
with XML types. 



2.5 Monadic Second-Order Logic 

MSO over trees is one of the most expressive - yet decidable - logic known. 
It is known since the 1960's that MSO exactly captures regular tree types. 
The appropriate MSO[-<ch, ^sb] variant over finite binary trees is named WS2S 
which stands for weak monadic second-order logic of two successors. WS2S was 
introduced in [Tliatclu^r and Wright. 1968, Donor. 1970]. In this calculus, first- 
order variables range over tree nodes. Second-order variables are interpreted 
as finite sets of tree nodes. Weak means that the set variables are allowed 
to range only over finite sets. This is enough since XML documents have an 
unbounded depth but remain finite trees. Monadic means that quantification 
is only allowed over unary relations (sets), not over polyadic relations. The 
two successors refer to the left and right successors of a node in the binary 
tree. They are sufficient to consider general unranked XML trees without loss 
of generality, owing to the mapping /?(•) presented in Section 2.5.1. 

This section progressively introduces WS2S in detail, and explains how it is 
decided through the automaton- logic connection [Thatcher and Wright, 1968, 
Doner, 1970] using tree automata introduced in Section 2.1.4. 



2.5.1 Preliminary Definitions 

For notation consistency purposes, by convention, is used for denoting the left 
successor and 1 for denoting the right successor of a node in a binary tree. The 
definition of the domain of a finite binary tree is thus slightly updated as follows. 
For t G 7^,Dom(t) is defined as the subset of {0, 1} such that if t = a{to,ti) 
with (T e S and to,ti £T^, then Dom(i) = {e} (J{iu\i £ {0, 1}, u £ Dom(ii)}. 
e represents the root while vj represents the {j + 1)*'' successor of v, for j £ 
{0, 1}. A node in the binary tree is thus a finite string over the alphabet {0, 1}. 

The notion of characteristic sets is now defined, which further formalizes 
and generalizes the unary predicates introduced in Section 2.3 for the label- 
ing. A characteristic function of a set B is a function from A to {0,1}, where 
A is a, superset of i?. It returns 1 if and only if the clement of A is also an 
element of B : 

B CA 

/: A^{0,1} 

w A \ ( l. if a £ B 
Va £ A,f{a) = 



0,if a ^ B 

A characteristic set is a subset of a set A that contains all elements of A for 
which the characteristic function returns 1: 

Xf C A 

Xf = {a£A\ f{a) = 1} 
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In the following, characteristic sets of interest are subsets of Dom(t), which 
denote where a particular property holds in a tree. Particular attention is paid 
to the characteristic sets Xf^ which denote where a particular symbol a occurs. 
Consider for instance the binary tree t — a{b{e, c(e, d)), e) over the alphabet E = 
{a, 6, c, d}. It is identified by its tuple representation t = {Xf^,Xf^,Xf^,Xf^) 
where Xj^ is the characteristic set of the symbol a: 





= H 


Xh 


-{0} 


Xu 


= {01} 


Xu 


= {011} 



The set Xj^ U X/^, U U of all positions contained in characteristic sets 
forms a shape. 

A node belongs to a characteristic set Xf^ (also noted X^) if and only if 
the node is labeled by a. Note that in the example of Figure 2.1, one and 
only one symbol occurs at each position. In the general case however, there is 
no restriction on the content of characteristic sets. A given node may belong 
to several characteristic sets. In this case, a node may be labeled by several 
symbols. This can be used to encode other properties than XML labeling. On 
the opposite, a particular position may not be a member of any characteristic 
set. In this case, the overall structure contains a node which is not labeled 
by any symbol of the considered alphabet; therefore it is no longer a labeled 
tree on this alphabet. Chapter 3 examines how XML trees can be encoded 
by constraining these structures using WS2S formulas introduced in the next 
section. 



2.5.2 WS2S Formulas 

From a syntactic point of view, WS2S formulas can be generated by a simple 
core language, whose abstract syntax follows: 



'Cws2s 3 ^' formula 

X CY inclusion 
X ~ Y — Z difference 

X = Y.O first successor 

X = Y.l second successor 

^<i> negation 

<& A ^ conjunction 

3X.^ existential quantification 

where X, Y, and Z denote arbitrary second-order variables. Other usual 
logical connectives can be derived as syntactic sugars of the core: 

dcf 



$ V = ^(^$ A ^^f) 



^^l =^ V * 



dcf 



<i><:=>^' = <f>A*V^(f>A^\I' 
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Note that only second order variables appear in the core. This is because first 
order variables can be encoded as singleton second-order variables. A notation 
convention is adopted for simplifying the remaining part of the chapter: first- 
order variables are noted in lowercase and second-order variables in uppercase. 

2.5.3 WS2S Semantics 

This section gives an interpretation of WS2S formulas as finite subsets of 
{0, 1}*. Given a fixed main formula Lp with k variables, its semantics is de- 
fined inductively. Let a tuple representation t — {Xi, ...,Xk) G ({0,1}*)'^ be 
an interpretation of </3. The notation t{X) denotes the interpretation X^ (such 
that 1 < i < /c) that t associates to the variable X occurring in if. The seman- 
tics of ip is inductively defined relative to t. The notation Lp (which is read: 
t satisfies ^p) is used if the interpretation t makes Lp true: 

f N X C y iff t{X) C t{Y) 
t\= X = Y - Z iS t{X) = t{Y) \ t{Z) 
t\= X = Y.O iff t{X) = {p.O I p e t{Y)} 
t\= X ^ Y.l iff t{X) = {p.l I p e t{Y)} 

1 1= -^(p iS t!i^ tp 
t \= ipi A ip2 t \= ipi and t \= Lp2 

t\= BX.ifi iff 3/ C {0,1}*, t[X ^ I]\= if 



where the notation t[X t-^ I] denotes the tuple representation that interprets 
X a,s I and all other variables as t does. Note that the two successors of a 
particular position always exist in WS2S. 

A formula ip naturally defines a language C{ip) = {t \ t \^ ip} over the 
alphabet ({0, 1}*)'' , where k is the number of variables of (p. 

2.5.4 Equivalence of WS2S and FTA 

It has been known since the 1960's that the class of regular tree languages is 
linked to decidability questions in formal logics. In particular, WS2S is de- 
cidable through the automaton-logic connection [TliatchtT and Wright. 1968, 
Donor. 197(1], using tree automata (introduced in Section 2.1.4). In 1968, 
Thatcher and Wright proved the following equivalence: 

Theorem 2.5.1 ([Thatcher and Wright, 1968]) WS2S is as expressive as 
finite tree automata. 

The proof works in two directions. First, it is shown that a WS2S formula can 
be created such that it simulates a successful run of a tree-automaton. Second, 
for any given WS2S formula a corresponding tree automaton can be built. 

Technically, the correspondence of WS2S formulas and tree automata relies 
on a convenient representation that links the truth status of a formula with 
the recognition operated by an automaton. This representation is a matricial 
vision of the tuple representation described in Section 2.5.1. Let t be a tuple, 
its matricial representation i is indexed by variables indices and positions in 
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the tree. Entries of i correspond to values in {0, 1} of characteristic functions: 
an entry {v,p) = 1 in i means that the position p belongs to the variable Xy. 

Consider for instance the formula ip = {3X3Y. Y = Z.O A X = Z.l) which 
has three variables X, Y, and Z. A typical matrix looks like: 



Note that this matrix is finite since only finite trees axe considered. It fur- 
thermore allows to capture finite trees of unbounded depth. As a counterpart, 
there is an infinite number of matrices that define the same interpretation: 
any number of columns of zeros may be appended at the right end of the ma- 
trix (for positions after the end of the tree). Let i be the minimum matrix, 
without such empty suffix. Rows of the matrix are called tracks and give the 
interpretation of each variable, which is defined as the finite set {p \ the bit for 
position p in the Xi track is 1}. 

Each column of the matrix is a bit vector that indicates the membership 
status of a node to the variables of the formula. The automaton recognizes all 
the interpretations (matrices) that satisfy the formula. A line by line reading 
of the matrix gives the interpretation of each variable (i.e. its associated set of 
positions), whereas an automaton processes the matrix column by column; it 
transits on each bit- vector. 

2.5.5 From Formuleis to Automata 

Given a particular formula, a corresponding FTA can be built in order to decide 
the truth status of the formula. 

Let <f he a, formula with k second-order variables. As an interpretation 
of ip, consider a tuple representation t = {Xi, ...,Xk) G ({0,1}*)'^. The tree 
automaton that corresponds to (fi is noted ^[[(^s]. Afipj operates over the al- 
phabet S = {0, 1}*^, and can be seen as processing i column by column. Note 
however that there is an infinite number of matrices that defines the same in- 
terpretation. On one hand, any number columns of zeros can appear at the end 
of the matrix. On the other hand, a column of zeros can also appear for any 
position in the tree, before a non-empty column, denoting that this position is 
not a member of any interpretation. The automaton therefore faces a problem: 
when recognizing a column of zeros, knowing if the recognition should stop 
(because the end of the tree has been reached) or continue. In other terms, 
the automaton needs to know the maximal depth of the tree as an additional 
information in order to know when to stop. To this end, a new termination 
symbol ± is introduced. From a matric;ial point of view, this symbol appears 
as a component of a bit- vector whenever this component will not be 1 anymore 
for the remaining bit-vectors to be processed. Technically, ^|^] recognizes the 
tree representation toft, t is obtained from t as follows: 

1. the set of positions of t is the prefix-closure of Xi U ... U 

2. leaves of t are labeled with J.*^ 



( 00 01 010 1 



X 
Y 

z 



1 1 
10 1 
1 1 
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3. binary constructors of the tree are labeled with an element of {-L,0, l}'^ 
such that the j"^ component of a position p in t is marked: 1 if and only 
if p G , if and only if p ^ Xi and some extension of p is in Xi , and _L 
otherwise 

Note that in this tree representation, _L appears as a component of a node 
label whenever no descendant node has a 1 for the same component. For 
example, Figure 2.9 gives the tuple, the matrix, and the tree representation of 
a particular satisfying interpretation of the formula X (-Y. 

({0},{0,1}) 





e 





1 


X 





1 





Y 





1 


1 



00 




11 _L1 

/ \ / \ 

_L_L _L_L _L_L _L_L 

Figure 2.9: Representations of a Satisfying Interpretation oi X QY 



Theorem 2.5.2 ([Thatcher and Wright, 1968, Doner, 1970]) For every 
formula Lp, there is an automaton A\ip\ such that: 

Lp = A\lp\ accepts t 

The automaton A\'^\ is calculated using an induction scheme. A basic 
bottom-up tree automaton corresponds to each atomic formula: 

AIX<ZY\ = \{ q^L\{q,q), q^m{q,q) ),{q} 

01(g,Q), q^ll{q,q) 




AIX = Y-Zl = 



AlX = Y.Ql = 



q ^ 




q < 


-±±0((z,(?), 


q ^ 


-±0±{q,q), 


q < 


-±00(g,(z). 


q 


-±01{q,q), 


q < 


-iiU?,?), 


q ^ 


-0±±{q,q), 


q < 


-0±0(g,(z), 


q ^ 


-0±l{q,q), 


q < 


-00±(g,(z), 


q ^ 


- 000(9, g), 


q < 


" 001(9, g), 


q ^ 


-OU{q,q), 


q < 


-11^(9,'Z), 


q ^ 


- 110(9, g), 






q ^ 






-00(g,9') \ 


q'. 


-00{q\q) 




-01(9", 9) \ 


q" 


- 1^(9,9) 


9"^ 


-10(9,9) J 



)Aq} 
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AIX = Y.l\ 




q'^00{q',q) q'^Ql{q,q") ) ,{q'} 
q"^\L{q,q) q"^\Q{q,q) J 



) 



Logical connectives are then translated into automata-theoretic operations, 
taking advantage of the closure properties of tree automata (presented in Sec- 
tion 2.L4). Formula conjunction is translated into intersection of automata: 



Existential quantification relies on projection and determinization of tree 
automata. The automaton is derived from A\'^\ by projection. This 

means the alphabet of ^|3X.(^] has to be one element smaller than the al- 
phabet of y^|</3]. In every tuple of A\ip\ the X component is removed, so that 
its size is decreased by one. The rest of the automaton remains the same. 
Intuitively, ylpX.(/3] acts as Af^pl except it is allowed to guess the bits for 
X. The automaton ^pX.(^] may be non-deterministic even if A\'~p\ was not 
[Comon et al., 1997], that is why determinization is required. 

As a result, for every formula ip it is possible to build an automaton A\tf\ 
in this manner, which defines the same language as ip: 



Analyzing the automaton A\f\ allows to decide the truth status of the 
formula (p: 

• if — then ip is unsatisfiable; 

• else p is satisfiable. If /:(C(y^l¥'l)) = then p is always satisfiable (valid) . 

Possessing the full automaton corresponding to a formula is of great value, 
since it can be used for generating examples and counter-examples of the truth 
status of the formula. A relevant example (or counter-example) can be built 
by looking for an accepting run of the automaton (or its complement). 

2.5.6 WS2S Complexity 

Two factors have a major impact on the cost of a WS2S decision procedure: 

1. the number of second-order variables in the formula 

2. the number of states of the corresponding automaton (automaton size) 

The number of second-order variables determines the alphabet size. More 
precisely, a formula with k variables is decided by an automaton operating on 
the alphabet E = {0, 1}*^. Representing the transition function 5 of such an 
automaton can be prohibitive. Indeed, in the worst case, the representation of 
a complete FTA requires 2*^ • IQj'^ transitions where Q is the set of states of the 
automaton. A direct encoding with classical FTA such as the one described 
in Section 2.5.5 would lead to an impracticable algorithm. Modern logical 



Alpihp2\ = Alpi\r\Alp2\ 



and negation is translated into automata complementation: 



Ahpl=Z{Alpl) 



C{AM)^C{p) 
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solvers represent transition functions using BDDs [Bryant, 1986] that can lead 
to exponential improvements [Klarlund and MoUcr, 2001, Tanabc ct al., 2005]. 

As seen in Section 2.5.5, automaton construction is performed inductively 
by composing automata corresponding to each sub-formula. During this pro- 
cess, the number of states of intermediate automata may grow significantly. 
Automaton size depends on the nature of the automata-theoretic operation 
applied and the sizes of automata constructed so far. Each operation on tree 
automata particularly affects the size of the resulting automaton: 

• Automata intersection causes a quadratic increase in automaton size in 
the worst case, as well as all binary WS2S connectors (A, V, =J>) that 
involve automata products [Klarlund et al., 2001]. 

• when considering deterministic complete automata, automata comple- 
mentation corresponding to WS2S negation is a linear-time algorithm 
that consists in flipping accepting and rejecting states. 

• The major source of complexity originates from automata determiniza- 
tion which may cause an exponential increase of the number of states 
in the worst case [Comon ct al., 1997]. Logical quantification involves 
automaton projection (c.f. Section 2.5.5) which may result in a non- 
deterministic automaton, thus involving determinization. Hopefully, a 
succession of quantifications of the same type can be combined as a single 
projection followed by a single determinization. However, any alterna- 
tion of second-order quantifiers requires a determinization, thus possibly 
causing an exponential increase of the automaton size. 

As a consequence, the number of states of the final automaton correspond- 
ing to a formula with n quantifier alternations is in the worst case a tower of 
exponentials of height c • n where c is some constant, and this is a lower bound 
[Stockmeyer and Meyer, 1973]. The translation from logical formulas to tree 
automata is thus non- elementary'^ : 

Theorem 2.5.3 [Meyer, 1975, Stockmeyer, 1974] The satisfiability problem 
for WS2S formulas has an unbounded stack of exponentials as worst case lower 
bound. 

This high complexity, originating from the full construction and complementa- 
tion of intermediate tree automata, is the counterpart of WS2S expressiveness 
and succinctness. Chapter 3 of this dissertation investigates how it is pos- 
sible to deal with this complexity in practice, proposes a decision procedure 
for XPath containment based on WS2S along with optimizations of the WS2S 
decision procedure in the XML setting. 

^The term elementary introduced by Grzegorczyk [Grzegorczyk, 1953] refers to func- 
tions obtained from some basic functions by operations of limited summation and limited 
multiplication. Consider the function towerQ defined by: 

{towerin, 0) = n 
tower{n, k + 1) = 2*°™'='-(".'=) 

Grzegorczyk has shown that every elementary function in one argument is bounded by 
Xn.tower{n, c) for some constant c. Hence, the term non- elementary refers to a function 
that grows faster than any such function. 
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2.6 Temporal Logics 

Some temporal and fixpoint logics closely related to FO and MSO have been 
introduced and allow to avoid explicit automata construction. 

2.6.1 FO Relatives 

For query languages, Computational Tree Logic (CTL) has been proposed in 
[Clarke and Emerson, 1981]. CTL is equivalent to FO over tree structures 
[Barcelo and Libkin, 2005] and its satisfiability is in EXPTIME. The connec- 
tion between XPath and FO relatives like CTL has been studied in [Marx, 2004b, 
Miklau and Suciu, 2004, Barc(4(') and Lihkin. 2005]. In particular, the work 
found in [Marx, 2004b] characterizes a subset of XPath in terms of extensions of 
CTL, whose satisfiabihty is in EXPTIME. Authors of [Miklau and Suciu, 2004] 
also observed that a fragment of XPath can be embedded in CTL. However, reg- 
ular tree languages are not fully captured by FO [Benedikt and Segoufin, 2005] . 
These approaches are therefore not intended to support XML types. 

In a attempt to reach more expressive power, the work that is presented 
in [Afanasiev et al., 2005] proposes a variant of Propositional Dynamic Logic 
(PDL) [Fischer and Ladner, 1979] with an EXPTIME complexity, but whose 
exact expressive power (as a strict subset of MSO) is still under study. 

The goal of the XPath research presented so far is limited to establishing 
new theoretical properties and complexity bounds. 

The research presented in this dissertation differs in that it seeks, in addition 
to the previous goals, efficient implementation techniques and concrete design 
that may be directly applied to XML type-checking problems involving XPath 
queries and regular tree types. 

2.6.2 MSO Relatives 

The propositional modal /x-calculus introduced in [Kozen, 1983] has been shown 
to be as expressive as non-deterministic tree automata [Emerson and Jutla, 1991] 
From [Arnold and Niwinski, 1992, Kupfcrnian and Vardi, 1999], it is known 
that WS2S is exactly as expressive as the alternation-free fragment (AFMC) 
of the propositional modal /x-calculus. The /x-calculus subsumes all early log- 
ics such as CTL and PDL (see [Barcelo and Libkin, 2005] for a recent sur- 
vey on tree logics). The /x-calculus is trivially closed under negation, can be 
extended with converse programs, and still remains decidable in EXPTIME 
[Vardi, 1998]. The best known complexity for the resulting logic is 

20(n^.log n) 

[Gradcl ct al.. 2002]. As a counterpart of its substantially inferior complexity, 
it looses the succintness of MSO. Fixpoint logics are indeed notorious for be- 
ing difficult to understand, even for reasonably expert people, as pointed by 
[Bradfiold and Stirluig, 2001]. However, it is assumed in this dissertation that 
this is not a problem since the logic is only intended as a target for the com- 
pilation of XML concepts. As such, the /i-calculus constitutes an interesting 
alternative for studying MSO-related problems. From a theoretical perspec- 
tive, the AFMC with converse sounds as an appropriate logic for XML: it is 
expressive enough to capture a significant class of XPath decision problems, 
while offering an interesting balance between complexity and expressiveness. 
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The work found in [Tanabc ct al., 2005] proposes a decision procedure for 
the AFMC, whose time complexity is 2'^*^" '°^ However, models of the logic 
are Kripke structures (general infinite graphs), and the logic lacks the finite 
model property (i.e. there exist formulas which are satisfiable on Kripke struc- 
tures and unsatisfiable on finite trees). In a preliminary work on XML type- 
checking, a logic for finite trees was presented [Tozawa. 2004], but the logic is 
not closed under negation. 

Chapter 4 of this dissertation studies how the recent AFMC decision pro- 
cedure proposed in [Tanabc ct al., 200-5] can be used in the context of XML. 
Based on the outcome of these investigations, the final Chapters 5 and 6 prove 
the decidability of a new logic for finite trees, derived from the /x-calculus, in 
time 2°(") and propose an effective algorithm for checking its satisfiability in 
practice. 

2.7 Systems for XML Type-Checking 

This section presents other related work on XML type-checking frameworks, 
which do not definitely aim at supporting XPath. Actually, none of the ap- 
proach presented in this section is able to effectively deal with the expressive 
power of the XPath fragment considered in this dissertation (and presented 
in Section 2.2.1). Nevertheless, this section gathers the main approaches and 
ideas developed elsewhere for static type-checking in the XML setting. Al- 
though notably different, several approaches can be seen as complementary to 
the work proposed in this dissertation. Most techniques are based on regular 
tree languages and use tree automata introduced in Section 2.1.4. 

2.7.1 Formulations of the Static Validation Problem 

The paper [Audcbaud and Rose, 2000] was influential in clearly defining the 
static validation problem. As an early attempt, it also proposes a set of typing 
rules to establish relationships between the input and output type of an XSLT 
transformation, but the method is only applicable to a tiny fragment of XSLT. 
The XML type-checking problem was later described in [Suciu. 2002]. A more 
recent survey work on the static type checkers for XML transformation lan- 
guages can be found in [MoUcr and Schwartzbach, 2005]. The remaining part 
of this section presents the major known frameworks and innovations around 
the type-checking of XML. 

2.7.2 Inverse Type Inference with Tree Transducers 

The paper [Suciu, 2002] describes how static type-checking can be performed 
using forward type inference. Forward type inference refers to the ability to 
automatically deduce the output type of the XML document derived from the 
evaluation of an XML transformation. This is usually done by inference rules, 
and corresponding type inference algorithms are generally polynomial in the 
XML setting [Tozawa. 2001]. Type inference is used to do type-checking. For 
instance, if a program is assumed to return a type Tout; once the inferred 
output type Tq"[ is known, type-checking can be performed by testing the in- 
clusion T^Sl ^ Tont- The work found in [Milo ct al., 2003, Suciu, 2002] reveals 



33 



2. Foundations of XML Processing 



an important limitation of forward type inference in the context of XML: un- 
fortunately, forward type inference is not complete. This is because the output 
type of a program may actually be a non-regular tree language that cannot 
be infered. In that case, the infered regular type is typically a larger approx- 
imation of the actual type, and the type-checker rejects the correct program, 
because Tq"^ % Tout (an example and details on this limitation can be found in 
[Suciu, 2002]). 

The work found in [Mile ct al.. 2003] introduces the technique of inverse 
type inference in an attempt to overcome this problem. Inverse type inference 
computes the allowed input language for a so-called fc-pebble transducer given 
its output language. The resulting algorithm has non-elementary complexity. 
The paper [Martens and Neven, 2003] investigates how the expressive power of 
tree transducers must be further restricted in order to allow a polynomial time 
decision algorithm. The practical relevance and usability of techniques based 
on tree transducers have not yet been demonstrated. 

XSLTO The paper [Tozawa, 2001] examines a fragment of XSLT called XSLTO 
which covers the structural recursion core of XSLT. It relies on inverse type 
inference to perform exact static validation, in the manner of [Milo ct al., 2003] 
but with a more efficient (exponential time) algorithm. However, XSLTO does 
not support XPath but only allows simple child steps in the recursion. Com- 
piling XSLT into XSLTO is thus possible for only the simplest transformations. 

2.7.3 XDuce, CDuce, Xtatic 

XDuce [Hosoya and Pi<n'c<\ 2003] was the first domain specific programming 
language with type-checking of XML operations. The most essential part of 
the type system is the subtyping relation, which is defined by inclusion of 
the values represented by the types (this is also called structural subtyping*). 
The proposed algorithm for subtyping attempts to avoid the worst case ex- 
ponential time complexity in practical cases. Instead of relying on tree au- 
tomata determinization, it checks the inclusion relation by a top-down traver- 
sal of the original type expressions. XDuce's algorithm builds on the pre- 
vious work found in [Aiken and Murphy, 1991], and extends it with several 
implementation techniques. The resulting algorithm appears efficient in prac- 
tice [Hosoya and Pierce, 2003]. XDuce has provided the foundation for later 
languages, in particular the CDuce [Bonzakcn ct al., 2003, F'riscli, 2004] and 
XStatic [Gapcv<:'^• and Pi<T( (\ 2003] languages. The CDuce language attempts 
to extend XDuce towards being a general purpose functional language. To 
this end, CDuce provides a more sophisticated type system featuring function 
types, intersection and negation types. It extends XDuce with higher-order 
functions, variations of pattern matching primitives, and parametric polymor- 
phism [Hosoya ct al., 200.5a]. Xtatic aims at integrating the main ideas from 
XDuce into C*. All these languages support pattern-matching through reg- 
ular expression types but not XPath. As pointed in [C'olazzo ct al., 2004], a 
major difference is that pattern-matching implements a one-match semantics, 

* Structural subtyping is usually opposed to nominal subtyping in which type compati- 
bility and equivalence are not determined by the type's structure but through explicit dec- 
larations and names of the types. See [Su ct al., 2002] and [Simeon and Wadlcr, 2003] for 
more details on subtyping paradigms. 
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i.e. every pattern, instead of collecting every matched piece of data (as in 
standard query languages such as XPath), only binds the first match. Al- 
though some recent work shows how to translate parts of XPath into Xtatic 
[Gapcyev and Pierce, 2004], the XPath fragment considered does not include 
reverse axes nor negation in qualifiers. 

2.7.4 Symbolic XML Schema Containment 

The work found in [Tozawa and Hagiya. 2003] proposes a symbolic algorithm, 
based on binary decision diagrams [Bryant. 1986], in order to solve the con- 
tainment between two XML schemas. The algorithm appears to be efficient 
in practice and favorably compares to the one used by XDuce. The idea of 
using symbolic techniques is similar to the one used in implementations pre- 
sented in Chapters 4 and Chapter 6. The implicit encoding of FTA presented 
in [Tozawa and Hagiya, 2003] is however significantly simpler since it only con- 
siders XML types (XML types only use a simple form of tree navigation; they 
do not need upward nor multidirectional navigation in trees as XPath does). 
Nevertheless, this work was the first to reveal the interest of using implicit 
techniques in the context of XML. This work suggests and motivates further 
developments such as simplifications for particular cases of the more general 
symbolic techniques used in Chapters 4 and Chapter 6. 

2.7.5 XJ 

The XJ [Harrcn ct al., 2005] language aims at integrating XML processing 
closely into Java. Types are regular expressions over XML Schema declara- 
tions. The type system has two levels: regular expression operators and XML 
Schema declarations. A peculiarity of XJ is that subtyping on the schema level 
is nominal, i.e. type compatibility and containment is determined by explicit 
declarations and the name of the types (as in Java). This aspect contrasts with 
the structural subtyping systems used in XDuce (and in this dissertation). XJ 
subtyping on the regular expression level is defined as regular language inclu- 
sion on top of the schema subtyping. [MoUer and Schwartzbach, 2005] argues 
that an inherited drawback of the underlying nominal style of subtyping is that 
a given XML value may be tied too closely with its schema type, which thus 
makes certain transformations more complex than they could be. XJ neverthe- 
less provides an interesting experiment of integration of type-safe processing in 
Java, and a detailed study of nominal subtyping in the context of XML can be 
found in [Simeon and Wadler, 2003]. 

2.7.6 Approximated Approaches for XSLT 

Several approaches aim at proposing XSLT debugging features at compile-time 
by choosing to sacrifice exact decidability and to settle for pragmatic approx- 
imations instead. Among this line of work, the paper [Dong and Bailey, 2004] 
aims at conservatively analyzing the flow of an XSLT transformation. It uses 
the control-flow information to detect unreachable templates and guarantee 
termination. The analysis is however less precise than the more recent one 
found in [MoIIct ct mL. 2005]. The work [MoUcr ct al.. 2005] presents a more 
complete approximated technique that is able to statically detect errors in 
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XSLT stylesheets. Their approach could certainly benefit from using the exact 
algorithm proposed in Chapter 6 instead of their conservative approximation. 

2.7.7 Path Correctness for ^XQ Queries 

The work found in [Colazzo ct al., 2006] proposes a sound and complete type 
system for ensuring path correctness for XML queries. The notion of naviga- 
tion correctness is similar to the emptiness problem formulated in chapter 4.6 
that can be used for detecting contradictions. The common idea is that if a 
subexpression of a query always yields an empty result then this should be con- 
sidered as an error. The considered query language in [Colazzo ct al.. 2006], 
called ^XQ, covers a minimal core of XQuery [Coa^ ct al., 2006] but ignores 
reverse navigation. In comparison, the XPath fragment considered in this dis- 
sertation includes all axes. The algorithm presented in Chapter 6 may provide 
perspectives on how to extend the type system of [Colazzo ct al.. 2006] to deal 
with reverse navigation. 

2.8 The Spatial Logic Perspective 

Spatial logics are formalisms traditionally used for describing the behavior and 
spatial structure of concurrent systems. The main ingredient of spatial log- 
ics is an operator called composition (or separation), which usually permits 
reasoning over concurrent and mobile processes [Boncva and Talbot. 2005]. 
Spatial logics have recently been found useful in the study of semistructured 
data and related query languages as they allow to express properties about 
structures such as graphs [Cardelli et al., 2002, Dawar ct al., 2004] and trees 
[CardcUi and GlioUi, 2004]. 

The work found in [Cardelli and Ghclli, 2004] proposes the TQL logic as 
the core of a query language for semistructured data represented as unranked 
trees and unordered trees. The TQL logic is based on the ambient logic 
[Cardelli and Gf)rdon. 2000, Cardelli and Gordon. 2006]. It is known that TQL 
is more expressive than MSO since it can express some counting properties 
about trees that can not be defined in MSO. It has been shown that a fragment 
of the ambient logic contained in TQL is undecidable [Charatonik ct al.. 2003]. 
Nevertheless, decidable fragments of TQL could be useful for building type sys- 
tems for semistructured data such as the one proposed in [. 'alcagiio ot al., 2003], 
and also for testing emptiness and containment of queries, as suggested in 
[Cardelli and Ghclli, 2004]. TQL thus provides an interesting foundation for 
further research. 

The work found in [Boncva and Talbot, 2005] considers a fragment of TQL 
called STL and characterize its expressiveness. STL satisfiability is shown 
undecidable but some syntactic restrictions over STL formulas allow to capture 
MSO. 

The logic TL described in [Dal-Zilio et al., 2004] is also based on the ambi- 
ent logic. TL can be encoded into the so-called sheaves automata proposed in 
[Dal-Zilio and Lugicz, 2003], whose transitions are conditioned by Presburger 
formulas. 

The major difference between these spatial logics and the work presented 
in this dissertation is that spatial logics operates on unordered trees, whereas 
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this dissertation considers ordered trees (cf. Section 2.1.1) such as structured 
documents. On one hand, the extension of TQL's data model with ordering is 
an interesting and important open issue [Conforti et al., 2002]. On the other 
hand, extending the logic of ordered trees proposed in the Chapters 5 and 6 of 
this dissertation with counting constraints is also an interesting and promising 
perspective. These research directions can thus be seen as complementary and 
could certainly benefit from a reciprocal inspiration. 

2.8.1 The Sheaves Logic 

The work found in [Dal-Zilio and Lusioz. 200(i] introduces a modal logic for 
documents called GDL, inspired from TQL, and proves the decidability of a 
fragment of GDL called the Sheaves logic. The Sheaves logic (SL) operates on 
ordered trees, and combines regularity and counting constraints. SL provides 
an interleaving operator for dealing with mixed ordered and unordered content. 
One one hand SL lacks recursion, i.e. fixpoint operators which are needed for 
supporting query langages (cf. Chapter 4); one the other hand SL allows to 
reason about numerical properties of the contents of elements, and may provide 
the inspiration for the integration of counting constraints in the logic presented 
in Chapter 5, kept for future work. 
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Chapter 3 

Monadic Second- Order Logic for 

XML 



3.1 Introduction 

This chapter first investigates how MSO can be used in the context of XML, 
despite its non-elementary complexity^ . A sound and complete decision proce- 
dure for containment of XPath queries is proposed based on MSO. Specifically, 
XPath queries are translated into equivalent formulas in WS2S introduced in 
Section 2.5.2. Using this translation, the logical formulation of the contain- 
ment problem is constructed, and optimized, by taking into account XPath 
peculiarities. The containment formula is then decided using tree automata. 
When the containment relation does not hold between two XPath expressions, 
a counter-example XML tree is generated. A complexity analysis is provided, 
along with practical experiments. 

Chapter Outline Section 3.2 presents the encoding of XML trees into WS2S. 
Section 3.3 explains the translation of XPath queries to logical formulas. A 
complexity analysis and an optimization method are given in Section 3.4. Ex- 
perimental results and the outcome of this approach are respectively discussed 
in Sections 3.5 and 3.6. 

3.2 Representation of XML Trees 

Section 2.5.1 presented how characteristic sets can be used for describing shapes. 
A shape is basically a second order variable, interpreted as a set of nodes, for 
which particular properties hold. Using WS2S, this section now expresses ad- 
ditional requirements that a shape should fulfill in order to be an XML tree. 

The first requirements are structural. First, in order to be a tree, a shape 
X must be prefix-closed, that is, for any position in the tree, any prefix of this 



^It is well known that type inference for higher-order typed lambda calculi can have non- 
elementary complexity, and is nevertheless effectively used by typed functional programming 
languages such as those of the ML family [ilenglciu and Mairson, 1991]. 
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position is also in the tree: 

PrefixClosed(X) =^ Vx.Vt/.((y x.l V y = x.O) AyeX)^xeX 

This ensures the shape is fully connected. Second, a predicate for the root of 
X is defined: 

IsRoot(X,a;) a; e X A -n{3z.z E X A{x = z.lV x = z.O)) 

In order to be a tree and not a hedge, X must have only one root with no 
sibling: 

SingleRoot(X) Vx.IsRoot(X, x) ^x.X^X 

Then, the labeling of the tree must be consistent with XML. The same symbol 
may appear at several locations in the tree with different arities: either as a 
binary constructor or as a leaf. However, one and only one symbol is associated 
with a position in the shape. Assume that the set of characteristic sets forms 
a partition: 

Partition(X,Xi,...,X„) =^ X Ur=i ^ Disjoint(Xi, X„) 
Disjoint(Xi,...,X„) A.^,^»nX, =0 

this prevents a node to have multiple labels, but it also prevents a tree to be 
labeled using an infinite alphabet. The problem comes from declaring X — 
\Jl^x -^i i^^i prevents any other symbol to occur in the tree. Consider instead 
that the characteristic sets must be disjoint, then a position in the tree may not 
be a member of any of the considered characteristic sets. That is how labeling 
from an infinite alphabet is emulated. As a result, an XML tree is encoded in 
the following way: 

XMLTree(X,Xi,...,A:„) PrefixClosed(X) 

A SingleRoot(X) 
A Disjoint(Xi, ...,X„) 
A X^(/} 

where X is the tree (non-empty in order not to get degenerated results) and the 
XiS are the characteristic sets. Figure 3.1 introduces how this is formulated in 
MONA Syntax [Klarhmd and MoUer, 2001], for the case of two characteristic 
sets of interest named Xbook and Xcitation. The only difference is that the 
shape X is declared as a global free variable named $ together with associated 
restrictions, instead of being passed as a parameter to predicates. In MONA 
syntax, "var2" is the keyword for declaring a free second-order variable; "alll" 
is the universal quantifier for first-order variables; and "&" and "|" respectively 
stand for the "A" and "V" connectives. 



3.3 Interpretation of XPath Queries 

This section explains how an XPath expression can be translated into an equiva- 
lent WS2S formula. This logical interpretation basically consists in considering 
a query as a relation that connects two tree nodes: a context node from which 
the query is applied, and a result node (selected by the query). 
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ws2s ; 

# Data Model 

var2 $ where "empty ($) 
k (alll X : alll y : ((y=x.l I y=x.O) 

& (y in $)) => X in $) 
& alll r : (r in $ & ~(exl z : z in $ 
& (r=z.l I r=z.O))) 

=> r . 1 notin $ ; 

# Characteristic sets 
var2 Xbook, Xcitation; 

# Partition 

((alll X : X in Xbook =>x notin Xcitation) 
&(alll X : X in Xcitation =>x notin Xbook)); 



Figure 3.1: Sample XML Tree in MONA WS2S Syntax. 



3.3.1 Navigation and Recursion 

As a first step toward a WS2S encoding of XPath expressions, the navigational 
primitives over binary trees must be expressed. Considering binary trees in- 
volves recursion for modeling the usual child relation on unranked trees (c.f. 
Figure 2.1 and the isomorphism between binary and unranked trees detailed 
in Section 2.1.1). Recursion is not available as a basic construct of WS2S. Re- 
cursion can be defined via a transitive closure formulated using second-order 
quantification. 

The following-sibling relation is first expressed in WS2S. Consider a second- 
order variable F as the set of nodes of interest . The following-sibling relation is 
defined as an induction scheme. The base case just captures that the immediate 
right successor of x is effectively its first following sibling: 

{x.l e F) 

Then the induction step states that the immediate right successor of every 
position in F is also among the following siblings, and formulates this as a 
transitive closure: 

Vz.(2 F ^ z.l & F) 

The global requirement for a node y to be one of the following siblings of x is 
now formulated. The node y must belong to the set F which is closed under 
the following- sibling relation starting from x.l: 

{x.l e F A \/z.z e F ^ z.l e F) ^ y e F 

Note that this formula is satisfied for multiple sets F. For instance, the set of all 
tree nodes satisfies this implication. Actually, only the smallest set F for which 
the formula holds is of interest: the set which contains all and only all following 
siblings. A way to express this is to introduce a universal quantification over 
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F. Indeed, ranging over all such set of nodes notably takes into account the 
particular case where F is minimal, i.e. the set of interest. If the global formula 
holds for every F, y is also in the minimal set that contains only the following 
siblings of x. Therefore, the XPath "following-sibling" axis is defined as the 
WS2S predicate: 

followingsibling(X, X, y) =^VF.F C X ^ 

{{x.l e F A Vz.z e F ^ z.l e F) ^ y e F) 

that expresses the requirements for a node y to be a following sibling of a node 
X in the tree X. XPath "descendant" axis can be modeled in the same manner. 
The set D of interest is initialized with the left child of the context node, and 
is closed under both successor relations: 

descendant (X, x,y) yO.D C X ^ 

{x.O eD /\ \/z.{z e D ^ z.l e D /\ z.O e D) ^ y e D) 

Considering these two relations as navigational primitives, more complex ones 
can be built out of them: 

child(X, X, y) '== y = x.O V followingsibling(X, x.O, y) 

following(X, X, y) =^ 3z.z G X A z.l G X A ancestor(X, x, z) 
A descendant(X, z.l, y) 

self(X, x,y)'^ X — y 

descendantorself(X, x, y) =^ self(X, y) V descendant(X, x, y) 

Eventually, the other XPath axes are defined as syntactic sugars by taking 
advantage of XPath symmetry: 

ancestor(X, x, y) '= descendant(X, y, x) 
parent(X, x, y) =^ child(X, y, x) 
precedingsibhng(X, x, y) =^ followingsibling(X, y, a:) 
ancestororself(X, x, y) =^ descendantorself(X, y, x) 
preceding(X, x, y) =^ following(X, y, x) 

3.3.2 Logical Composition of Steps 

This section describes how path composition operators are translated into 
logical connectives. The translation is formally specified as a "derivor" shown 
on Figure 3.2 and written We|e]^ where: 

• the parameter e (surrounded by special "syntax" braces |]) is the source 
language parameter that is rewritten; 

• the additional parameters x and y are respectively the context and the 
result node of the query. 
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WeJ-]' : Expression Node Node £ws2s 
Wel/pjl Bz.isrootiz) A WpMl 

Welei ie2ir= WeleiEVWeNE 

Wp : Path -> Node -> Node -> £^s2s 
yVpMqm = WpMl AWMy 

Wq : Qualifier — > A^orfe /^ws2s 
Wglqi and gal. = Wjgil, A W^fel. 
Wjqi or = Wgklh V Wjgzl. 

Wjnot q], WJ(Z1. 



Figure 3.2: Translating XPath into WS2S. 



The compilation of an XPath expression to WS2S relies on Wp in charge 
of translating paths into formulas, and the dual derivor Wq for translating 
qualifiers into formulas. The basic principle is that Wp|p]^ holds for all pairs 
x,y of nodes such that y is accessed from x through the path p. Similarly, 
Wqlg]^; holds for all nodes x such that the qualifier q is satisfied from the 
context node x. 

The interpretation of path composition Wplpi/paK consists in checking the 
existence of an intermediate node that connects the two paths, and therefore 
requires a new fresh variable to be inserted. The same holds for We|/p]^ that 
restarts from the root to interpret p, whatever the current context node x is. 

Paths can occur inside qualifiers therefore Wg, Wp and Wq are mutually 
recursive. Since the interpretations of paths and qualifiers are respectively 
dyadic and monadic formulas, the translation of a path inside a qualifier Wq|p]x 
requires the insertion of a new fresh variable whose only purpose consists in 
testing the existence of the path. 

Eventually, the translation of steps relies on the logical definition of axes: 
a{x,y) denotes the WS2S predicate defining the XPath axis a, as described in 
Section 3.3.1. For instance. Figure 3.3 presents the WS2S translation of the 
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# Translated XPath expression: 






o U X <^li J 


ws2s ; 




# Data Model 








("ami ■!!■ • a111 V • ('fv=-!f 1 1 v=-x- 0) 




Hr (\T -in 'I => V -in 'fi'l 




J!r T • f T -in 'fi h " ( (^^A t- • ■t' -in "ft 




& (r=z.l 1 r=z.O))) 




=> r . 1 notin $ ; 




# Characteristic sets 




var2 Xbook, Xcitation, Xsection; 




# Partition 




((alll x: X in Xbook => x notin Xcitation 




& X notin Xsection)& 




(alll x: X in Xcitation => x notin Xbook 




& X notin Xsection)& 


(alll x: X in Xsection => x notin Xbook 




& x notin Xcitation)); 


# Query (parameters are context and result 


nodes) 


pred xpathl (varl x, varl y)= 




exl xl : child(x,xl) & xl in Xbook 




& descendcLnt (xl ,y) & y in Xcitation 




& exl x2 : parent (y,x2) & x2 in Xsection 





Figure 3.3: WS2S Translation of a Sample XPath in MONA Syntax. 



XPath expression: 

child: :book/descendant: :citation[parent: :section] 



3.3.3 Formulating XPath Containment 

The XPath containment problem can now be expressed in terms of a logical 
formula. Given two XPath expressions ei and 62, the WS2S formula corre- 
sponding to checking their containment is built in two steps. First, each XPath 
expression is translated into a WS2S logical relation that connects two nodes 
in the tree, as presented in Section 3.3.2. Then the data model is unified. Each 
translation yields a set of characteristic sets. The union of them is built, so 
that characteristic sets that correspond to symbols used in both expressions 
are identified. 

From a logical point of view, e\ C 62 means that each pair of nodes (a;,y) 
such that X and y are connected by the logical relation corresponding to ei is 
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similarly connected by the logical relation obtained from €2'- 

Vx.Vy. WeleiE^W.IesE (3.1) 

The containment relation holds between expressions ei and 62 if and only if the 
WS2S formula (3.1) is satisfied for all trees. With respect to the notations of 
Section 3.2, the containment between expressions ei and 62 is thus formulated 
as: 

VX XMLTree(X,Xi,...,X„) ^ (Vx e X. Vy G X. WeleiH ^ We[e2E) 

where the Xi are members of the union of all characteristic sets detected for 
each expression. Consider for instance the two XPath expressions: 

ei child::book/descendant::citation[parent::section] 

62 =^ descendant::citation[ancestor::book and ancestor: :section] 

Figure 3.4 presents the generated WS2S formula for checking containment be- 
tween ei and 62, in MONA syntax. The formula is determined valid (which 
means ei C 62) in less than 0.2 seconds, the time spent to build the corre- 
sponding automaton and analyze it. The formula for the reciprocal contain- 
ment check between 62 and ei is satisfiable, which means e2 % e-i- The total 
running time of the decision procedure is less than 0.9 seconds, including the 
generation of the counter-example, shown below: 

<book> 

<section> 
<other> 

<citatioii/> 
</other> 
</section> 
</book> 

3.3.4 Soundness and Completeness 

Soundness and completeness of the decision procedure for XPath Containment 
are ensured by construction. Indeed, consider the initial definition of the con- 
tainment problem: provided a XML tree, checking containment between two 
XPath ei and 62 consists in determining if the following proposition holds: 

Vx,5e[eila;C54e2lx (3.2) 

By definition, (3.2) is logically equivalent to: 

Va;,Vy,y e 5elei]a; ^ y e 5e|e2]a; (3.3) 

Then the last step remaining to prove is the equivalence between (3.3) and 
(3.1). To this end, the compilation of XPath expressions into WS2S formulas 
must preserve XPath denotational semantics, which means: 

Theorem 3.3.1 The logical translation of XPath expressions is equivalent to 
XPath denotational semantics: 

WMl^V^SM^ (3-4) 
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ws2s ; 

# Checking XPath Containment between 

# ' child: :book/descendant : : citation [parent : : section] ' 

# and 'descendant: : citation [ancestor :: book 

# and ancestor :: section] ' 

# Data Model 

var2 $ where "empty ($) 
& (alll X : alll y : ((y=x.l | y=x.O) 

& (y in $)) => X in $) 
& alll r : (r in $ & ~(exl z : z in $ 
& (r=z.l I r=z.O))) 

=> r . 1 notin $ ; 

# Characteristic sets 

var2 Xbook, Xcitation, Xsection; 

# Queries (parameters are context and result nodes) 
pred xpathl (varl x, varl y)= 

exl xl : child (x,xl) & xl in Xbook 
& descendant (xl ,y) & y in Xcitation 
& exl x2 : parent (y,x2) & x2 in Xsection; 
pred xpath2 (vaxl x, varl y)= 

descendant (x,y) & y in Xcitation 

& exl xl : (ancestor (y,xl) & xl in Xbook) 

& exl x2 : (ancestor (y,x2) & x2 in Xsection); 

# Problem formulation 

((alll x: X in Xbook => x notin Xcitation 

& X notin Xsection)& 
(alll x: X in Xcitation => x notin Xbook 

& X notin Xsection) & 
(alll x: X in Xsection => x notin Xbook 

& X notin Xcitation)) 

=> 

(alll x: alll y: (xpathl (x,y)=> xpath2(x,y) ) ) ; 



Figure 3.4: Sample WS2S Formula for XPath Containment in MONA Syntax. 
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Proof (Sketch) The proof uses an induction over the structure of paths. 
Since the definition of paths and qualifiers is cross-recursive, a mutual induction 
scheme is used. The scheme relies on the dual property for qualifiers that also 
needs to be proved: 

Vp,Vx,(5Jglx = Wjgl,) (3.5) 

Specifically (3.4) is proved by taking (3.5) as assumption, and reciprocally 
(3.5) is proved under (3.4) as assumption. Both equivalences (3.4) and (3.5) 
are proved inductively for each compositional layer. The idea basically consists 
in associating corresponding logical connectives to each set-theoretic composi- 
tion operator used in the denotational semantics. XPath qualifier constructs 
trivially correspond to logical WS2S connectives. Path constructs involves 
set-theoretic union and intersection operations which are respectively mapped 
to logical disjunction and conjunction. Two path constructs: pi/p2 and p[q\ 
require specific attention in the sense their denotational semantics introduce 
particular compositions over sets of nodes. They are recalled below: 

Splpi/P2lx = {X2 I Xi £ Splpijx AX2 e Splp2lxi} 

Splp[q]}x {xi I xi e Splpjx A 

Auxiliary lemmas are introduced in order to clarify how these constructs are 
mapped to WS2S. The XPath construct pi/p2 is generalized as a function 
product{), whereas the XPath construct p[q\ is generalized as filter{): 

product : Set (Node) {Node Set{Node)) Set{Node) 
filterQ : Set{Node) {Node -> Boolean) Set{Node) 

product{) is characterized by the lemmas (3.6) and (3.7), in which y and z are 
nodes, and 5 is a set of nodes. These lemmas abstract over XPath navigational 
functionalities performed by axes by letting / denoting a function that returns 
a set of nodes provided a current node: 

Vy, Vz, VS", V/ : Node ^ Set{Node), ze S^ye {fz) ^ye product{S, /) 

(3.6) 

Vty,V5,V/ : Node Sct{Node),y e product{SJ) ^3z,z e S Ay e {fz). 

(3.7) 

The function filter{) is in turn characterized by the following lemma: 

yy,yg : Node — > Boolean, y e filter {S,g) ^ y G S" (3.8) 

The auxiliary lemmas (3.6), (3.7), and (3.8) are also proved by induction. 
Developing the proof in constructive logic involves the (trivial) decidability 
of set-theoretic inclusion and of the denotational semantics of qualifiers. The 
full formal proof is detailed in [Geneves and Vion-D^^^^ 2004]. It has been 
mechanically checked by the machine using the Coq formal proof management 
system [Huct ct al., 2004]. 

3.4 Complexity Analysis and Optimization 

The translation of an XPath query to its logical representation is linear in the 
size of the input query. Indeed, each expression is decomposed then trans- 
lated inductively in one pass without any duplication, as shown by the formal 
definition of We in Section 3.3.2. 
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The second step is the decision procedure, which, compared to the trans- 
lation, represents the major part of the cost. The truth status of a WS2S 
formula is decided throughout the logic-automaton connection as described in 
Sections 2.5.4 and 2.5.5 of previous Chapter 2. This translation from logical 
formulas to tree automata, while effective, is unfortunately non-elementary. 
This bound may sound discouraging. Fortunately, the worst-case scenario 
which corresponds to complex formulas, is not likely to occur for small in- 
stances of the containment in practice. Furthermore, recent works on MSO 
solvers - especially those using BDDs techniques [Bryant. 1986] such as MONA 
[Klarlund and M0llcr; 2001] - suggest that in particular practical cases the ex- 
plosiveness of this technique can be effectively controlled. 

In practice, the implementation relies on MONA [Klarlund and M0llcr, 2001] 
that implements the WS2S decision procedure along with various optimiza- 
tions. Additionally, a significant optimization that takes advantage of XPath 
peculiarities for combating automaton size explosion is described in the follow- 
ing subsection. 

3.4.1 Optimization Based on Guided Tree Automata 

A major source of complexity arises from the translation of composed paths. 
Each translation of the form Wp|pi/p2K introduces an existentially quantified 
first-order variable which ranges over all possible tree positions (c.f. Figure 3.5). 

The idea in this section is to take advantage of XPath navigational peculiar- 
ities for attempting to reduce the scope associated to such variables. XPath 
navigates the tree step by step: each step selects a set of nodes which is in 
turn used to select a new one by the next step. The interpretation of a vari- 
able inserted during the translation of pi j-pi corresponds to the intermediate 
node which is a result of pi and the context node oi p2- The truth status of 
the formula is determined by the existence of such an intermediate node at 
a particular position in the tree. If one can determine regions in the tree in 
which such a node may appear from those where it cannot appear, valuable 
positional knowledge is gained that can be used to reduce the variable scope. 
It is interesting to try to identify the region in the tree (or even some larger 
approximation) in which the node must be located in order for the formula to 
be satisfied. XPath sequential structure of steps makes it possible to exploit 
such positional knowledge. Indeed, consider for instance the expression: 

63 =' /child::book/descendant::*[child::citation] 

63 navigates from the document root through its "book" children elements and 
then selects all descendant nodes provided they have at least one child named 
"citation" . Several conditions must be satisfied by a tree ti in order to yield a 
result for 63: 

• <i must have at least one "book" element as a child of the root; 

• <i must have at least one element that must be a descendant of the "book" 
element; 

• for this node to be selected it must have at least one child named "cita- 
tion" . 
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el(x,y) = exl xl : isroot(xl) & xl in $ 
& exl x2 : child(xl,x2) & x2 in Xbook 
& descendant (x2 ,y) & y in $ 
& exl x3 : child(y,x3) & x3 in Xcitation; 



Figure 3.5: WS2S Translation of 63 in MONA Syntax. 




^ 0^ 

\ 

1 \ A 



.0 /• 

' K 

- .0 ^ 



Figure 3.6: Depth Levels in the Unranked and Binary Cases. 



This is made explicit by the logical translation We lea]! MONA syntax shown 
on Figure 3.5. In this translation, xl, x2 and x3 denote the respective positions 
of the root node, a "book" child, and a "citation" child of the selected position 
y. These variables actually only range over a particular set of positions in the 
tree. By definition, the root can only appear at depth level 0, the "book" 
element can only occur at level 1 and its descendants occur at any depth level I 
greater or equals to 2. Eventually, the "citation" element should occur at level 
^ + 1. This is because each step introduces its particular positional constraint 
which can be propagated to the next steps. 

The idea of taking advantage of positional knowledge is even more general. 
Theoretically, normal bottom-up FTA are sufficient for deciding validity of a 
WS2S formula (as presented in Section 2.5.4 of Chapter 2). However com- 
position of such automata is particularly sensitive to state space explosion, 
as presented in Section 2.5.6. Guided tree automata (GTA) [Biehl et al.. 1997] 
have been introduced in order to combat such state space explosion by following 
the divide and conquer approach. A GTA is just an ordinary FTA equipped 
with an additional deterministic top-down tree automaton called the guide. 
The latter is introduced to take advantage of positional knowledge, and used 
for partitioning the FTA state space into independent subspaces. Top-down 
deterministic automata are strictly less powerful than ordinary (bottom-up or 
non-deterministic top-down) FTA [Comou ct al., 1997]. However, this is not a 
problem since the guide is only intended to provide additional auxiliary infor- 
mation used for optimization purposes. As a consequence, the more precise is 
the guide, the more efficient is the decision procedure, but an approximation is 
sufficient. The guide basically splits the state space of the FTA in independent 
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subsets. Therefore the transition relation of the bottom-up automaton is spHt 
into a family of transition functions, one for each state space name. A state 
space name corresponds to a particular depth level or a set of depth levels. 
GTA can be composed in the same way than ordinary FTA as explained in 
Section 2.5.4 of Chapter 2. A GTA can be seen as an ordinary tree automaton, 
where the state space has been factorized according to the guide. A GTA with 
only one state space is just an ordinary tree automaton. A detailed description 
of GTA can be found in [Bichl ct al., 1997]. GTA-based optimization may lead 
to exponential improvements of the decision procedure [Elgaard ct al., 2000]. 

A tree partitioning based on the depth levels is now introduced. It is de- 
picted by Figure 3.6 for a n-ary sample tree and its binary counterpart. Based 
on this partitioning, a positional constraint (a restricted set of depth levels) 
is associated to each node variable. Indeed, a node referred by an XPath can 
occur at several depth levels since some axes involve transitive closure (c.f. Sec- 
tion 2.2.2 of Chapter 2). Moreover, the set of depth levels can even be infinite 
since XPath offers recursion in unbounded trees. 

The computation of sets of depth levels is calculated by the function shown 
on Figure 3.7, and written LeJejAr where e is the XPath expression to be 
analyzed and N is the set of positional constraints corresponding to the context 
node from which e is applied. Again, the algorithm proceeds inductively on 
the structure of XPath expressions. XPath steps are base cases for which the 
set of levels is effectively calculated from the previous one. Transitive closure 
axes such as "descendant" turn the set of depth levels into an infinite one, 
even if the previous was finite. Path composition basically propagates the 
level calculations by combining with the base cases. Note that an important 
precision can be gained with absolute XPath expressions. In this case, the 
initial set of depth levels is the singleton {0} as opposed to relative XPath 
expressions for which the context node is not known and the initial set of 
depth levels is subsequently N. 

The optimized compilation of XPath expressions to WS2S formulas is given 
on Figure 3.8. Wg, Wp and Wg are respective optimized versions of We, Wp 
and Wq , which convey a set of depth levels as an additional parameter passed 
to and Lp. These functions compute the restrictions on variable scope that 
are inserted by and W^. "3z [D] " denotes the fact that the existentially 
quantified first-order variable z is restricted to appear at a depth level among 
the set of depth levels D. In practice, Lg and Lp can be merged into 
and can be implemented in a single pass over the XPath expression. Thus the 
translation and the depth level computation remain linear in the size of the 
query. 

MONA provides an implementation of GTA. The application of the previous 
algorithm to 63 leads to the logical formulation shown on Figure 3.9 in MONA 
syntax. 

The guide obtained in this translation means that the root is labeled with 
"10"; its left and right successor nodes are labeled with "11" and "epsilon" 
respectively. The "epsilon" is a dummy state space reflecting the fact that the 
underlying shape is a tree and not a hedge. No variable is associated with this 
state space. The "lothers" state space represents any tree node occurring at 
a depth level greater than 3. Such a state space is associated with variables 
whose scope is of unbounded depth. The size of the guide depends on the 
maximum depth level found among the computed restrictions. Formally, a 
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Le ■ >CxPath Set{Int) Set{Int) 
LcI/pIn ='^ip|p]{o} 

Lelei I e2jN =^ LeleijN U Le[e2lAr 

Lelei n e2]jv = -C-eleiJiv n -LeIe2ljV 
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Figure 3.7: Computation of the Depth Levels of Nodes Selected by a Path. 

guide for a maximum depth level n is a top-down deterministic tree automaton 
with {qo, Qn+i} U {q^} as set of states, qo as the single initial state, and the 
following set of transitions: 

{qo (?l,9e)} 

U {qi {qi+i,qi) \ i € [l...n]} 

U {g„+i ^ (g„+i,g„+i)} 
U {qe^{qe,qe)} 

where qi {i G [0...n]) denotes the state space name corresponding to the depth 
level i, and qn+i represents all depth levels greater or equal to n + 1 . For for- 
mulating the XPath containment, the guide is computed from the two XPath 
expressions. Specifically, the deepest (and thus the most precise) guide is cho- 
sen as the guide for both expressions. 

Eventually, each variable is restricted with a list of state spaces that repre- 
sents the regions in the tree where its valuation must be searched. For instance. 
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Wg : >CxPath Node Node Set{Int) ^ £ws2s 
W'J/p}{x, y, N) 3z [{0}] .isrootiz) A W;M{z, y, {0}) 
W'Mi^,y,N)=W'pM{x,y,N) 
Wile, I e2l (x, y , TV) [dl (x, y , TV) V [esl (x, y , iV) 

W^lei n 621 (x, y, N) = W^Ieil (x, y, N) A [eal (x, y, N) 

Wp : Path Node Node Set{Int) £ws2s 

AT) "^'Bz [ipbiliv] .W'plpij{x,z,N)AW'plp2j{z,y,N) 

W>M1 (a;, 2/, A^) = W>1 (ar, y, AT) A W'^M {y, N) 

W;ia::aj {x, y, N) a{x, y)AyeX, 
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Figure 3.8: Translating XPath into WS2S with Restricted Variable Scopes. 



guide 10 -> (11, epsilon) , 

11 -> (12, 11), 

12 -> (13, 12), 

13 -> (lothers, 13) , 
lothers -> (lothers, lothers), 
epsilon -> (epsilon, epsilon); 

el(x,y)= exl [10] xl : (isroot(x) & x=xl & x in $) 

& exl [11] x2 : child(xl,x2) & x2 in Xbook 

& descendant (x2,y) & y in $ 

& exl [13, lothers] x3 : child(y,x3) 

& x3 in Xcitation; 



Figure 3.9: Optimized WS2S Translation of 63 in MONA Syntax. 
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"exl [11] x2" means the scope of the variable x2 is limited to tree nodes oc- 
curring at depth level 1. 

This optimization is useful for both kinds of XPath expressions: absolute 
and relative. More precise restrictions can be computed for absolute XPath 
expressions (for which the initial set of depth levels is the singleton {0}). 

3.5 Implementation and Experiments 

The approach has been implemented. A compiler (written in Java) takes XPath 
expressions and translates them into WS2S formulas. A Java interface controls 
the C-I--I- implementation of the MONA WS2S solver, and in addition provides 
precise runtime statistics on the decision procedure. 

The evolution of the intermediate automata (in terms of states, number of 
BDD nodes involved, the minimizations, products, projections...) are reported 
in realtime during a run of the decision procedure. For example. Figure 3.10 
shows detailed statistics on the intermediate automata built during the com- 
parison of the following two XPath expressions 64 and 65: 

64 == a/b[descendant::c]/following-sibling::d/e 

65 a/d[preceding-sibling::b]/e 

The horizontal axes of charts of Figure 3.10 correspond to the number of au- 
tomata operations. In that case, 380 operations were needed to complete the 
XPath containment test. Once the decision procedure terminates, the result of 
the comparison is displayed in the console: 

"a/b [descendcoit : : c] /f ollowing-sibling: : d/e" is contained in 
"a/d[preceding-sibling: :b] /e" [Total Time: 00:00:00.18] 

Extensive tests have been carried out with the implementation. Tests have 
been reported in [Geneves and Layaida, 2UUG]. They are not detailed here, 
since it is difficult to come up with a clear conclusion based on the observed 
practical behavior of this decision procedure on a few instances. Instead, only 
the major lessons learned from the practical experiments are summarized: 

• The GTA-based optimization has been observed to be particularly useful 
as guides cause a small overhead compared to the significant performance 
gains they provide on many instances. Some containment instances can- 
not be solved without this optimization. 

• For small expressions (that are most likely to occur in practice in XSLT 
transformations, as suggested by [Moller ot al., 2005]), it has been ob- 
served over many instances that the implementation can run in acceptable 
time and space bounds. Since this approach is sound and complete over 
a large XPath fragment, it provides an interesting alternative to the less 
complex but incomplete decision procedure over a very restricted XPath 
fragment previously studied in the literature [Miklau and Suciu, 2004]. 

• For larger XPath expressions however, intermediate tree automata con- 
structed can be so large that blow-ups are observed, even using GTA. 
Practical experiments notably suggest that the WS2S decision procedure 
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Figure 3.10: Statistics on Intermediate Automata for a Containment Check. 



implemented in MONA is particularly sensitive to the alphabet size, 
which clearly makes the approach inappropriate for XPath expressions 
that use a large number of tag names. 

• The explosiveness of the approach is very difficult to control in practice. It 
is possible to find relatively small expressions for which blow-ups cannot 
be controlled, even by the GTA-based optimization. Subsequently, there 
exist relatively small XPath containment instances for which containment 
cannot be decided in acceptable time and space bounds. 

• As a result, no clear conclusion can be drawn from the experiments, 
concerning the maximum size and complexity of XPath expressions for 
which this procedure could offer practical guarantees. Such a charac- 
terization is made very difficult by the huge number of parameters that 
must be taken into account, due to all the optimizations implemented in 
MONA [Klarlund et al., 2001]. It is thus very difficult to estimate up to 
which XPath expression size and complexity this decision procedure can 
be used in practice. Observed results on tested instances suggest that 
this approach may be efficient for XPath expressions using less than 10 
tag names, and indicate that it cannot be reasonably used with larger 
alphabets. 

3.6 Outcome 

An approach based on MSO has been proposed for the XPath containment 
problem: query containment is formulated in terms of a WS2S formula, which 
is then decided using tree automata. An optimization method based on guided 



56 



Outcome 



tree automata is proposed in an attempt to take advantage of XPath peculiar- 
ities in order to improve time and space requirements of the complex decision 
procedure. 

An advantage of the approach is that it provides a sound and complete 
decision procedure for a large XPath fragment. Another advantage of this 
technique is to allow generation of tree examples and counter-examples of the 
truth status of the formula. 

The major drawback of this approach, however, is that the decision proce- 
dure is based on the full construction and complementation of the intermediate 
automata. This makes the explosiveness of the approach very hard to control 
in practice and unfortunately restricts its use to only small XPath expressions. 

Surprisingly enough, the full construction and determinization of interme- 
diate FTA often seems unnecessary. Indeed huge intermediate automata are 
almost always reduced by following projection operations. This can been ob- 
served on most practical scenarios owing to the detailed statistics reported by 
the implementation (see for instance the peaks in the evolution of intermediate 
automata states on Figure 3.10). The determinization of huge intermediate 
automata is the source of uncontrollable blow-ups in practice. On many in- 
stances, it has been observed that the memory representation of intermediate 
automata may require several hundreds of megabytes (or even several gigabytes 
which is not affordable on most current machines), even if this appears to be 
unnecessary since the final resulting automaton is only of several kilobytes in 
size. 

One direction of future work is to search for tree automata guides that 
produce a finer-grained partition of the automaton state space, in order to 
enhance the scalability of the decision procedure. Another perspective is to 
search for approaches that do not construct unnecessary parts of intermediate 
automata, or even do not construct automata at all. This is the motivation 
that underlies investigations presented in the next chapter. 
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4.1 Introduction 

Investigations presented in this chapter are motivated by a search for automata 
theoretic approaches that avoid exphcit construction of tree automata. In 
this direction, this chapter attempts to build efficient decision procedures for 
XML problems by using the alternation-free modal /i-calculus. This logic is 
as expressive as WS2S, less succinct, but has a lower complexity (exponential 
time). 

This chapter shows how XPath can be linearly translated into the /i-calculus. 
In addition, regular tree types (including DTDs) are also linearly embedded 
in the /^-calculus. XPath decision problems (containment, emptiness, equiva- 
lence, overlap, coverage) in the presence or absence of XML types are expressed 
as formulas in this logic. A state of the art decision procedure for /i-calculus 
satisfiability is used to solve the generated formula and to construct relevant 
example and/or counter-example XML trees. The system has been fully im- 
plemented. 

Chapter Outline The chapter is organized as follows: in Section 4.2 the 
/i-calculus is introduced; Section 4.3 explains how general graph models of this 
logic can be restricted so that they represent XML trees. The translation of 
XPath queries into this logic is described in Section 4.4. Section 4.5 embeds reg- 
ular XML types into the logic. Based on these translations. Section 4.6 explains 
how to formulate and solve the considered decision problems. A complexity 
analysis is presented in Section 4.7, along with implementation principles of 
the system. Finally, the outcome of this approach is discussed in Section 4.8. 

4.2 The /i-Calculus 

The propositional ^i-calculus is a propositional modal logic extended with least 
and greatest fixpoint operators [Kozen, 1983]. A signature S for the /i-calculus 
consists of a set Prop of atomic propositions, a set Var of propositional vari- 
ables, and a set FProg of atomic programs. In the XML context, atomic 
propositions represent the symbols of the alphabet S used to label XML trees. 
Atomic programs allow navigation in trees. 
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The /i-calculus with converse-"^ [Vardi, 1998] augments the propositional fi- 
calculus by associating with each atomic program a its converse a (such that 
a — a). A program a is either an atomic program or its converse. Prog de- 
notes the set FProg U {a \ a G FProg}. This is the only difference with the 
propositional /i-calculus that lacks converse programs. Equipping the logic 
with converse programs is useful for supporting query langages that allow both 
forward and backward navigation in trees (see Section 4.4.2). Converse pro- 
grams generally provide a mean to reason about the past, which also proved to 
be useful in the context of program verification [Vardi, 1998]. The interaction 
of converse programs with other constructs of the logic is known to be quite 
subtle. In particular, in /i-calculus it is known that converse programs interact 
with recursion in such a way that the finite model property is lost [Vardi, 1998]. 
The decidability of the /K-calculus extended with converse was proved to be in 
EXPTIME in [Vrtrdi, 1998], by introducing a new class of alternating two-way 
automata on infinite trees. 

The set >Cj]''' of formulas of the /i-calculus with converse over the signature 
S is defined as follows: 

£^"9iy9,V' '■'■= formula 

T true 

I p atomic proposition 

I ^ip negation 

I if Alp conjunction 

I [a] if universal modality 

I X variable 

I fiX.ip least fixpoint 

where p £ Prop, X e Var and a is a program. Note that X should not occur 
negatively in iiX.ip. The following abbreviations are defined: 

± = 
(a) =' ^ [a] -^(f 

vx.ip ^^iX.^^C/x} 

(a) (fi is called the existential modality and i/X.ip the greatest fixpoint. The 
semantics of the full /i-calculus is given with respect to a Kripke structure 
K = {W,R,L) where W is a, set of nodes, R : Prog 2^^^ assigns to 
each atomic program a transition relation over W, and L is an interpretation 
function that assigns to each atomic proposition a set of nodes. The formal 
semantics function Itply shown on Figure 4.1 defines the semantics of a //- 
calculus formula (p in terms of a Kripke structure K and a valuation V. A 
valuation V : Var 2^ maps each variable to a subset of W. For a valuation 
V, a variable X, and a set of nodes W C W, V[X/W'] denotes the valuation 
that is obtained from V by assigning W' to X. 

Note that if (/? is a sentence (i.e. all propositional variables occurring in ip 
are bound), then no valuation is required. For a node w d W and a sentence 
(p, K,w ]= iff w e I'fil'^ denotes that ip holds at w in if. 

^The /i-calculus with converse is also known as the full /i-calculus, or alternatively as the 
two-way /i-calculus in the literature. 
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Figure 4.1: Semantics of the /i-Calculus. 



The two modahties (a) (p (possibihty) and [a] ip (necessity) are operators for 
navigating the structure. 

In order to avoid redundancy, only a subset of composed of formulas in 
negation normal form is of interest. A formula is in negation normal form if and 
only if all negations in the formula appear only before atomic propositions. Ev- 
ery formula is equivalent to a formula in negation normal form [Kozen, 1983], 
which can be obtained by expanding negations using De Morgan's rules to- 
gether with standard dualities for modalities and fixpoints (cf. Figure 4.2). 
For readability purposes, however, translations of XPath expressions given in 
Section 4.4 are not given in negation normal form. 





{a) -^p 


^{a)ip = 


[a] ^p 




vX.^pi'^l^x} 


-^vX.ip = 




-^{(pi A P2) = 


^P\ V ^P2 


^{ipi V <y92) = 


-^pi A -^p2 


^^Lp = 





Figure 4.2: Dualities for Negation Normal Form. 

For reasoning on XML trees, only a specific subset of /C|j"'\ namely the 
alternation-free modal /i-calculus with converse over finite binary trees is of 
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interest. 

A formula Lp in negation normal form is alternation-free whenever the 
following condition holds^: if ^X.ipi (respectively vX.ipi) is a subformula of Lp 
and i'Y.Lp2 (respectively ^Y.ip2) is a subformula of ipi then X does not occur 
freely in Lp2. 

The following section now introduces the additional restrictions of >Cj]'" 
related to finite binary trees. 

4.3 Kripke Structures and XML Trees 

In this section, the satisfiability problem of over Kripke structures is re- 
stricted to the satisfiability problem over finite binary trees. 

The propositional /i-calculus has the finite tree model property: a formula 
that is satisfiable, is also satisfiable on a finite tree [Kozou, liJfcib]. Unfor- 
tunately, the introduction of converse programs causes the loss of the finite 
model property [Vardi, 1998]. Therefore, the finite model property must be 
reinforced along with some other properties to ensure finite binary models that 
encode XML structures. 

First, each XML node has at most one S-label, i.e. p A p' never holds for 
distinct atomic propositions p and p' . This can be easily incorporated in a 
//-calculus satisfiability solver. 

Second, for navigating binary trees, only two atomic programs 1 and 2 
are used, together their associated relations -R(l) =^fc and i?(2) =^ns whose 
meaning is to respectively connect a node to its left child and to its right child. 
For any (x, y) G W x W, x -<ic y holds iff y is the left child of x (i.e. the 
first child in the unranked tree representation) and x -(ns y holds iff y is the 
right child of x in the binary tree representation (i.e. the next sibling in the 
unranked tree representation). 

For each atomic program a G {1,2}, R{a) is defined to be the relational 
inverse of i?(a), i.e., R{a) = {{vtu) : {u,v) G R{a)}. Thus programs a e 
{1,2,1,2} are considered inside modalities for navigating downward and up- 
ward in trees. 

Restrictions for a Kripke structure to form a finite binary tree are now 
defined. A Kripke structure T — {W, R, L) is a finite binary tree if it satisfies 
the following conditions: 

(1) W is finite 

(2) the set of nodes W together with the accessibility relation U <ns 
define a tree 

(3) and -<ns are partial functions, i.e. for all m G and j £ {1, 2} there 
is at most one rrij G W such that (m, rrij) G R{j)- 

A finite binary tree T = (W, R, L) satisfies ip ii T,r \^ ip where r G is 
the root of the tree T. 

The previous restrictions are now expressed in For accessing the root, 

the £^"11 formula 

(^root = [T] ± A [2] ± A - (2) T 

^For instance, uX.{fiY. {1> Y Ap)V {2) X is alternation-free but i/X.(nY. {1}Y AX)Vp is 
not since X bound by u appears freely in the scope of fiY. 
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is used. Its meaning is to select a node provided it has no parent and no sibling. 

The property for ensuring finiteness relies on Konig's lemma which states 
that a finitely branching infinite tree has some infinite path or, in other words, a 
finitely branching tree in which every branch is finite is finite. The expression 
vX. (1) X V (2) X is only satisfied by structures containing infinite or cyclic 
paths. To prevent the existence of such paths, the previous formula is negated 
and, propagating negation using the rules presented on Figure 4.2, yields the 
following formula: 

(^ft = ^lX. [1] X A [2] X 

(/9ft states that all descending branches are finite from the current context node 
((/9ft is vacuously satisfied at the leaves). (/9ft must hold at the root (i.e. (pmot A 
(/9ft must hold), in order to ensure structure finiteness. This is for condition (1) 
to be satisfied. 

Properties (2) and (3) still need to be enforced. This is done by rewriting 
existential modalities in such a way that if a successor is supposed to exist, then 
there exists at least one, and if there are many all verify the same property. 
This is a way to overcome the difficulty that in /j,-calculus, one cannot naturally 
express a property like "a node has exactly n successors". Technically, tp^^"^ 
denotes the formula ip where all occurrences of (a) tp are replaced by (a) T A 
[a] ij)^^'^ . Furthermore, a node cannot be both a left child and a right child: 
the formula (^ (l) T V ^ (2) T) must be satisfied at each node. 

Theorem 4.3.1 ([Tanabe et al., 2005]) A formula (p is satisfied by a 
finite binary tree model if and only if the formula Lproot/\IJ'X.{-^ (l) TV^ (2^ T)A 
[1] X /\ [2] X /\ p^^'^ is satisfied by a Kripke structure. 

The proof of the "if" part iteratively constructs a tree model and proceeds 
by induction on the structure on ip. The "only if" part is almost immediate. 
Theorem 4.3.1 gives the adequate framework for formulating decision problems 
on XML structures in terms of a /i-calculus formula. 

4.4 XPath Embedding 

This section explains how an XPath expression can be translated into an equiv- 
alent formula in Navigation as performed by XPath in unranked trees is 
translated in terms of navigation in the binary tree representation (using the 
isomorphism presented in Section 2.1.1). The translation adheres to XPath for- 
mal semantics in the sense that the translated formula holds for nodes which 
are selected by the XPath query. 

4.4.1 Logical Interpretation of Axes 

The formal translations of navigational primitives (namely XPath axes) are 
formally specified on Figure 4.3. The translation function noted "^^Ja]^" 
takes an XPath axis a as input, and returns its translation, in terms of a 
formula x given as a parameter. This parameter represents a context and 
allows to compose formulas, which is needed for translating path composition. 
^~*|a]^ holds for all nodes that can be accessed through the axis a from some 
node verifying x- 
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: Axis 




dcf 

= V 
A 




(T) Y\J (2)Z 


yl^ [[following- sibling] 


''=^'' (2) yV(2)Z 


A^ Ipreceding-sibling] ^ 


uZ. (2) Y V (2) Z 


A^ Jparent] ^ 


= (1) uZ.Y V (2) Z 


A^ Jdcsccndant] ^ 


'=V^. (T) (xvz)v(2)z 


A'^ |descendant-or-self] ^ 


=V^.xVMr.(T) (rvz)v(2)r 


A~^ [ancestor]^ 


'='(1)a^Z.xV(1)ZV(2>Z 


A~^ |ancestor-or-self] ^ 


= AiZ-X V (1) ^lY.Z V (2) y 


A^poUowinglx 


=' Jdcsccndant-or-selfJ^j^j-^) 


A~* Ipreceding]^ 


== [[descendant-or-self],,2(x) 


Viix) 


= [[following-sibling]^^ lanccstm--, 


m{x) 


= [[preceding-sibling] A^lancostor- 


Figure 4.3: 


: Translation of XPath Axes. 



For instance, the translated formula A^|cliild];^ is satisfied by children of 
the context x- These nodes are composed of the first child and the remaining 
children. From the first child, the context must be reached immediately by 
going once upward via 1. From the remaining children, the context is reached 
by going upward (any number of times) via 2 and then finally once via 1. 



4.4.2 Logical Interpretation of Expressions 

Figure 4.4 gives the translation of XPath expressions into -C^". The translation 
function "£^^|e]j(." takes an XPath expression e and a formula x (denoting 
a particular context) as input, and returns the corresponding translation. 
The translation of relative XPath expressions use the current context x- The 
translation of absolute expressions navigates from x to the root which is taken 
as initial context for the expression. 

For example. Figure 4.5 illustrates the translation of the XPath expression 
"child: : a [child: :6[" . This expression selects all "a" child nodes of a given con- 
text which have at least one "5" child. The translated £5^"" formula holds for 
"a" nodes which are selected by the expression. The first part of the trans- 
lated formula, (p, corresponds to the step "child::a" which selects candidates 
"a" nodes. The second part, "0, navigates downward in the subtrees of these 
candidate nodes to verify that they have at least one "6" child. 

Note that without converse programs it would have been impossible to dif- 
ferentiate selected nodes from nodes whose existence is tested, since properties 
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E^l/Vlx = W(^Z.^(T)Tv(2)ZAMr.xV(l)YV(2)r) 

l-l : Path ^C'f^C'f 

P^la::a\^ = ahA--la\^ 
P-Ia::*l/='A-Hx 

Figure 4.4: Translation of Expressions and Paths. 




Translated Query: cliild::a [cliild::6] 

a A {iiZ. (T) X V (2) Z) A (1) t^Y-b V (2) Y 

If Ip 

Figure 4.5: XPath Translation Example. 



must be stated on both the ancestors and the descendants of the selected node. 
Equipping the /Cj"" logic with both forward and converse programs is therefore 
crucial for supporting XPath"^ . Logics without converse programs may only be 
used for solving XPath emptiness but cannot be used for solving other decision 
problems such as containment efficiently. 

XPath most essential construct pi/p2 translates into formula composition 
in such that the resulting formula holds for all nodes accessed through 

P2 from those nodes accessed from x by pi. The translation of the branch- 
ing construct p[q\ significantly differs. The resulting formula must hold for all 
nodes that can be accessed through p and from which q holds. To preserve se- 
mantics, the translation of p[q] stops the "selecting navigation" to those nodes 

^One may ask whether it is possible to eliminate upward navigation at the XPath level 
but it is well known that such XPath rewriting techniques cause exponential blow-ups of 
expression sizes [Oltcanu et al., 2002]. 
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Q^lqi and ^slx = Q^kiJx ^ Q 

O^lnot qj^ "^'^ Q^Mx 
Q^Mx'^'P^Mx 
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: Path 


/-full /-full 

'-'fj. '-'fi 




\x 


dcf 


bll(P-Ip2L) 


p^MqW 


\x 


dcf 


M(xaQ-Mt) 




\x 




\4{x^<t) 




\x 


= A^[ 


I«lx 




I 


: Axis - 


/-full /-full 




\x 


'^'A-l 


[symmetric {a)l 



Figure 4.6: Translation of Qualifiers. 




reached by p, then filters them depending on whether q holds or not. This is 
expressed by introducing a dual formal translation function for XPath quali- 
fiers, noted Q^^l'zlx ^^'^ defined in Figure 4.6, that performs "filtering" instead 
of navigation. Specifically, ^'^[[•]. can be seen as the "navigational" translat- 
ing function: the translated formula holds for target nodes of the given path. 
On the opposite, Q^[[-]. can be seen as the "filtering" translating function: it 
states the existence of a path without moving to its end. The translated for- 
mula Q^l^lx (respectively i^^lpJx) holds for nodes from which there exists a 
qualifier q (respectively a path p) leading to a node verifying x- 

XPath translation is based on these two translating "modes" , the first one 
being used for paths and the second one for qualifiers. Whenever the "filtering" 
mode is entered, it will never be left. 

Translations of paths inside qualifiers are also given on Figure 4.6. They use 
the specific translations for axes inside qualifiers, based on XPath symmetry: 
symmetric{a) denotes the symmetric XPath axis corresponding to the axis a 
(for instance symmetric (child) = parent). 

4.4.3 Correctness and Complexity 

The translation of XPath in can be proven correct with respect to XPath 
denotational semantics. First, a Wadler-like semantics of XPath expressions is 
defined with respect to Kripke structures that are XML trees. Let JCt be the 
set of Kripke structures that are finite binary trees (as defined in Section 4.3) 
and 'W{K.t) = {w \ {W, i?, L) g /Ct} the set of nodes of such structures. 
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Given a finite binary tree T = {W, R, L) G ICt and some node x E W of T, the 
functions iSe[-](T,x)> '5pI-l(T,x)> '5gI-l(T,x) and 5a[-](T,K) respectively define the 
semantics of XPath expressions, paths, quaUfiers, and axes: 



Selli.,) ^XPath ^ /Ct ^ >V(/Ct) ^ 2>^(^-) 

•^ell/plcT,!;) = 5p|p](T,root(T)) 

^eMiT,x) = '5pbl(T,x) 

«5e|[ei I e2l(T,x) = ^elei}(T,x) U5e|[e2l(T,x) 

<Se|[ei n e2l(T,x) =^'5e|Iei](T,x) n5e|[e2l(T,x) 



<5p|biMl(T,x) =^ {z e «5p|b2l(T,j/) I y e 5p|[pi](T,x)} 

«5p|[a::*l(T,x) =^ {v & <5a|[al(T,x)} 



•^gll'lc-,-) • Qualif — > /Ct — > W(/Ct) — > {true, false} 
andg2l(T,x) =^<5g[gil(T,x) V5g|g2](T,x) 

^qUl Org2l(T,x) =^'5g|[gil(T,x) A5g|[g2](T,x) 
5, [not g](T,x) == -'SqlQl(T,x) 



67 



4. XML AND THE Modal /x-Calculus 



5a|self|(T,a: 
Salchildj i^(^W,RX),x 

^alfollowing-sibling] ((vy^fl^i) 
S a Ipreceding-sibling] ({w,r,l),x 

iSa|descendant] 



iSa|descendant-or-self] (^t.x 
5Q|ancestor](T,a: 



: Axis - 

dcf 



dcf 
dcf 



{yeW\x -(fc y} U {z e I a; ^fc y A J/ ^+ z} 

{zeW\x^+^ z} 
= {zeW\z ^+ x} 

= {peW \p^fc x}U {peW \ p^ky Ay x} 

= 5a|child](T,x) 

U {z G 5q [descendant] (T,a) | 2/ e 5q [[child] (T,a;) } 

= 5a [descendant] U 5a|self](T,a;) 
= 5a|parent](T,a;) 

U {z e 5a [[ancestor] (T,y) | y G 5a|parent](T,x)} 
5a|ancestor-or-self|(T_a;) 5a[[ancestor](T_^) U 5a|self|(T,a;) 



5a|following](T,x) = {z G 5a|descendant-or-self](T,y) | y G f(T,x)} 
5a|preceding](7^^^) {z G 5a|descendant-or-self](T^2^) | y G P(T.a;)} 
fiT,x) {y G 5aIfollowing-sibling](T,™) | w G a(T,j.)} 
P(T,a;) = {y e 5a[preceding-sibling]('r^a,) | w G a(T,a;)} 
a(T,a:) {w G 5Q[[ancestor-or-self|(T,2;)} 

The auxihary function root(r) returns the root of T, and the relation 
symbol used in the semantics of axes denotes the transitive closure of the 
relation defined in Section 4.3. 

The correctness of the translation of XPath into can now be stated: 

Theorem 4.4.1 (Translation Correctness) For any finite binary tree T G 
JCt, nodes x and y of T , property x G -C-^^", expression e G CxPath, and path 
p G Path, the following equivalences hold: 

(Vxg£^"" T,x^x ^ 



Proof outline: Each equivalence is proved by a straightforward structural 
induction that "peels off" the compositional layers of each set of rules. □ 



T,y^E^lel^ 

T,y^P-M^) 
T,x\=P^M.) 



yeSMiT,x) (4.1) 

ye IJ 5e[e](T,a;) 

{x I T.x\=x} 

(4.2) 

yG5pb](T,.) (4.3) 
yeSpMiT.x) (4.4) 
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This result links XPath decision problems to satisfiability in £|7 • Note that 
the size of a translated formula ^^^le];,,. is linear in the length of the XPath 
expression e since there is no duplication of subformulas of arbitrary length in 
the formal translations'*. 



4.5 Translation of Regular Tree Languages 

The translation of regular tree types into /i-calculus is now introduced. It 
is based on the binary representation of types introduced in Chapter 2. In 
order to simplify translations, a notation for a n-ary least fixpoint binder is 
introduced: 

let^ (^^•^i)i<^<,„ in V' 
This notation is actually a syntactic sugar for ip where all free occurrences of 
Xi have been replaced by jiXi.ifi until becomes closed (that is all Xi in ^p 
are in scope of their corresponding unary /^-binder) . This provides a shorthand 
for denoting a >Cjj"'' formula which would be of exponential size if expressed 
using only the unary least fixpoint construct. Such a naive expansion contains 
unnecessary duplicate formulas whereas the satisfiability solver operates only 
on a single copy of them (see Section 4.7). Therefore, the n-ary binder is a 
useful compact notation for representing translations of recursive types, 
without introducing useless blow-ups between representation of formulas and 
their satisfiability test. 

The translation from binary regular tree types into formulas is given 
by the following function |-] : 



[•1 : £bt - C'f 

{Ti I T2I = m V IT2I 



P(Xi,X2)l = AsiiCCi(Xi) A SUCC2{X2) 



[let Xi.T, in Tl let^ {X^m)l<^<ra in Fl 



where there is an implicit bijective correspondence between C^t variables from 
TVar and £jj"'' variables from Var. Note that the translations of the empty 
tree type and the empty tree are the same since empty trees should not be 
explicitly mentioned in satisfiability results. The function succ.(-) sets the tree 
frontier accordingly: 

succ.{-) : Prog x TVar -> C^^^^ 

«n,rr C / ^ if nullahle{X) 

^^^""^ > ^ \ {a) X if not nullahle{X) 



''Formulas in which the formal parameter x appears twice (see Figure 1. 1 and Figure 4.6) 
do not cause such duplication since at this stage x carries a constant. Section 4.6 explains 
how X is initialized with a constant at the expression level. 
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The predicate nullable{-) indicates if a type contains the empty tree: 

nullable{-) : TVarUCbt — > {true, false} 
nullable{X) =^ nullable{6{X)) 
nullable{%) =^ false 
nullable{e) =^true 
nullable{l) =^ false 
nunable{Ti \ T2) = nunable{Ti) V nullable{T2) 
nullable{l{Xi, X2)) =^ false 
nullable{let X~Ti in T) = nuUable{T) 

4.6 Solving XML Decision Problems 

Both XPath over unrankcd trees, and regular unrankcd tree types have been 
translated in the unifying £|^" logic over binary trees. Owing to these transla- 
tions, XML decision problems (such as XPath containment, equivalence, empti- 
ness, overlap and coverage) in the presence or absence of XML types axe now 
reduced to satisfiability in >C|^'^. 

Correlating Context Nodes for Path Comparison In order to correlate 
two difi^erent paths when performing any kind of mutual-relationship checking, 
a special atomic proposition (S) is introduced. This atomic proposition marks 

the initial context nodc(s) from which an XPath expression is applied. (S) is 
used as initial value of the x parameter of the translating function For 
an XPath expression e G £xPath, -E'^Ie]© is thus a sentence, that is denoted 
by iff, in the remaining. Owing to the introduc;tion of (S), formulas may refer 
to the same context multiple times. This allows to compare difi'erent XPath 
expressions applied to the same initial context that can be any node in any 
tree. 

Formulating of XML Problems Some simplified notations are first in- 
troduced: T denotes the set of trees: by default, T = T-£ , and whenever an 
optional DTD d e £dtd is specified T = fdj^. Additionally, (pr denotes the 
embedding of the tree language T. In the absence of DTDs (pr ~ T, and 
ifr = lJ3{d)} in the presence of d € >Cdtd- 

Several decision problems needed in applications can be expressed in terms 
of formulas: 

• XPath containment 

— Input: ei,e2 G £xPath and optional d G £dtd 

— Problem: Does 62 always select all nodes selected by ei? 

— Definition: \/t GT,\/xe t,Seleijx C iSeIe2]a; 

— Tested formula: (pe^ A -'(Pez 
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• XPath equivalence 

— Input: ei, 62 G /^xPath and optional d G £dtd 

— Problem: Does 62 always select exactly the same nodes as ei? 

— Definition: \/t G T, Va; S i,5e|ei]a; = 5e|e2]a; 

— Equivalence can be tested by two successive and separate contain- 
ment checks 

• XPath emptiness 

— Input: e G £xPath and optional d G £dtd 

— Problem: Will e ever return a non-empty set of nodes? 

— Definition: Vt G T,Va; G t,5e[[e]j: ^ 

— Tested Z:^"" formula: ipe 

• XPath overlap 

— Input: 61,62 G >CxPath and optional d G £dtd 

— Problem: May ei and 62 select common nodes? 

— Definition: V< G T,Vx G t,5e|ei]a; n 5e[e2lx ^ 

— Tested £^"" formula: (fei A ipe2 

• XPath coverage 

— Input: ci, 62, e„ G /CxPath and optional d G >Cdtd 

— Problem: Are nodes selected by ei always selected by one of the 

62, 6„? 

— Definition: \ft G T,\fx G t,iSe[[ei]a: C lJ2<i<n '^elcija; 

— Tested Z:^"" formula: ipei A /\2<i<n ^V<^. 

Note that for the containment problem, the unsatisfiability of (/^ei A ^V'ej is 
tested. Indeed, checking that an XPath expression 6\ is contained into another 
expression 62 consists in checking that the implication Lp^^ => Lp^^ holds for all 
trees. In other terms, there exists no tree for which the results of 6\ are not 
included in those of 62, i.e. the negated implication (pei A "'(/Se^ is unsatisfiable. 

Since the finite binary tree model property must be enforced (as seen in 
Section 4.3.1), decision problems are formulated from the root, and the actually 
checked formula becomes: 

^root A (^ft A (^r A M^-^tcstcd V (1) X V (2) Xf^'^ (4.5) 

where (/^tested corresponds to a particular XPath decision problem from those 
given above. Intuitively, the fixpoint is introduced for "plunging" XPath nav- 
igation performed by (^tested at any location in the tree. It is for example 
necessary for relative XPath expressions that involve upward navigation in the 
tree. 

It is important to note that formula (4.5) is always alternation-free since 
both embeddings of XPath and tree types produce alternation-free formulas, 
and the negation of an alternation free sentence remains alternation-free. In 
practice, negated sentences introduced by XPath embeddings are turned into 
negation normal form, by applying the rules given on Figure 4.2. 
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4.7 Complexity Analysis and Implementation Principles 

The proposed approach has been implemented. A compiler takes XPath ex- 
pressions as input, and translates them into /ijj''' formulas. Another compiler 
takes regular tree types as input (DTDs) and outputs their translation. 
The formula of a particular decision problem is then composed, normalized and 
solved. 

The ^-calculus satisfiability solver is specialized for the alternation-free /i- 
calculus with converse. It is closely inspired from the tableau methods de- 
scribed in [Tanabc et al., 2UU5] and [Pan ct al., 2UIJG]. A detailed description of 
the AFMC solver is beyond the scope of this chapter (see [Tanaljo ct al., 2005] 
for more details on an AFMC solver; and Chapter 6 for a detailed descrip- 
tion of a logical solver specialized for XML). The focus here is rather given 
to the AFMC solver aspects which allow to establish precise complexity re- 
sults for the considered XML decision problems with the /z-calculus approach. 
The algorithm relies on a top-down tableau method which attempts to con- 
struct satisfying Kripke structures by a fixpoint computation. Nodes of the 
tableau are specific subsets of a set called the Lean [Pan ct al., 2006]. Given 
a formula il) £ -C^'^ the Lean is the subset of the Fischer-Ladner closure 
[Fischer and Ladner, 1979] of t/j composed of atomic and modal subformulas 
of ip [Pan et al., 2006]. The algorithm starts from the set of all possible nodes, 
and repeatedly removes inconsistent nodes until a fixpoint is reached. At the 
end of the computation, if ip is present in a node of the fixpoint, then ip is 
satisfiable. In this case, the fixpoint contains a satisfying model that can be 
easily extracted and used as a satisfying example XML tree. 

The complexity of the addressed XML decision problems can now be stated: 

Proposition 4.7.1 XPath containment, equivalence, emptiness, overlap and 
coverage decision problems, in the presence or absence of regular tree con- 
straints, can be solved in time complexity 2'-'("''°9 where n is the Lean size 
of the corresponding formula. 

This upper-bound is derived from: 

1. the linear translations of XPath and regular tree types into the /i-calculus; 

2. the 2'^(" '°s ") time complexity of the solver, which corresponds to the 
best known complexity for deciding alternation-free /i-calculus with con- 
verse over Kripke structures [Tnii.ibp ot al., 200^]. Note that this com- 
plexity is smaller than the best known complexity for the whole /i-calculus 
with converse [Vardi, 199>] which is 2°("'-i°s ") [Grtidcl et al, 2002]. 

The key observation for the linear translation of regular tree types is that only 
distinct atomic and modal subformulas of the translated formula are present in 
the Lean, even for a n-ary binder Lp = let^ {^i-Vi)i<i<m ™ -^k- More precisely, 
the Lean corresponding to the translation of Lp contains at most: 

• the two eventualities (a) T for a = 1, 2 

• 2 • TO universalities [a] where to is the number of binary tree type vari- 
ables in the binder and the constant factor corresponds to the downward 
programs a = 1 , 2 
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• the atomic propositions representing the alphabet symbols used in Lp 

Deriving complexity from properties of the closure of a formula was first 
used by Fischer and Ladner for establishing decidability of PDL in single expo- 
nential time [FischtT and Ladner. 1979]. Analog observations have also been 
made for the modal logic K [Pan et al.. 2006], and the /i-calculus over general 
Kripke structures [Tanabe et al., 2005]. These results can be seen as an appli- 
cation of this technique to the case where regular tree types are combined with 
XPath bidirectional queries over finite trees. 

Keys for the efficiency of the method on large practical instances are as 
follows: 

1. Nodes of the tableau contain only modal formulas and exactly one atomic 
proposition (for XML), which greatly reduces the number of enumerated 
nodes for large alphabets. 

2. Negation in the /i-calculus is rather straightforward compared to au- 
tomata techniques. Indeed, handling >Cjj"'' formulas in negation nor- 
mal form simply reduces to checking membership of atomic proposi- 
tions in tableau nodes. This contrasts with tree automata techniques 
which require for every negation the full construction and complemen- 
tation of automata with an exponential blow-up. As pointed out in 
[Baadcr and Tobies, 2001] and [Pan et al., 200G], tableau methods for log- 
ics with the tree model property can be viewed as implementations of the 
automata-theoretic approach which avoids an explicit automata construc- 
tion. 

3. The implementation relies on representing sets of nodes and operating on 
them symbolically using Binary Decision Diagrams (BDDs) [Bryant. 1980] 
BDDs provide a canonical representation of boolean functions. Their ef- 
fectiveness is well known in the domain of formal verification of systems 
[Edmund M. Clarke et al.. 1999]. BDD variables encode truth status of 
Lean formulas. The cost of BDD operations is very sensitive to vari- 
able ordering. Finding the optimal variable ordering is known to be 
NP-complete [Hojati et al.. 1996]. However, several heuristics are known 
to perform well in practice [Edmund M. Clarke et al.. 1999]. Choosing a 
good initial variable order does significantly improve performance. Pre- 
serving locality of the initial problem happens to be essential. It can be 
easily observed that the variable order determined by the breadth-first 
traversal of the initial formula (thus keeping sister subformulas in close 
proximity while ordering Lean formulas) yields better results in practice. 

There are still areas for improvements though. In particular, a large amount 
of time is spent in the /i-loop detection performed by the solver for avoiding 
cycles and infinite paths in the case of finite recursion [Tanabe et al., 2005]. 
From this perspective, transforming the /i-calculus formula at the syntactic 
level (as presented in Section 4.3) and then relying on loop detection to en- 
force the finite model property is overkill. The approach may be improved by 
considering XML finite tree structures as models of the logic, and building an 
appropriate satisfiability solver for such structures. 
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4.8 Outcome 

An approach for solving XPath decision problems by reduction to satisfiability 

of alternation-free modal /i-calculus with converse over general Kripke struc- 
tures has been proposed. XPath queries and regular tree types are linearly 
translated into the AFMC. XML decision problems are expressed as formulas 
in this logic, then decided using a solver for AFMC satisfiability. With respect 
to MSO, this yields much more efficient (exponential time) decision procedures 
for XML decision problems. Nevertheless, this approach may still be greatly 
improved, since models of the logic are too general for the XML setting, and 
one has to pay extra costs for restricting them appropriately. One direction 
of future work consists in designing a more appropriate calculus where models 
are finite trees instead of general Kripke structures. This is what is achieved 
in the remaining of this dissertation. 
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Chapter 5 



A Fixpoint Modal Logic with 

Converse for XML 



5.1 Introduction 

This chapter and the following introduce the final results of this thesis, based 
on the lessons learned from the investigations reported in previous chapters. 

The decidability of a new logic with converse for finite and ordered trees 
is proved. The logic is sufficiently expressive to support XPath bidirectional 
navigation in finite trees along with regular tree languages. The logic is de- 
rived from the /i-calculus and inherits some of its desirable properties, while 
improving the best known complexity for finite trees. These discoveries are 
naturally applied to the static analysis of XML specifications, for which they 
yield sound, complete and efficient decision procedures. The proof method is 
based on two auxiliary results. First, XML regular tree types and XPath ex- 
pressions have a linear translation to cycle-free formulas. Second, the least and 
greatest fixpoints are equivalent for finite trees, hence the logic is closed under 
negation. 

Chapter Outline This chapter presents focused trees in Section 5.2 as a 
convenient data model for XML. The logic is then introduced in Section 5.3, 
and translations of XML concepts into the logic are presented in Section 5.4. 

5.2 Focused Trees 

In this chapter, a less conventional approach is used to represent XML trees, 
called focused trees. Focused trees are directly inspired by Huet's Zipper data 
structure [Huet, 1997], and are closely related to pointed trees introduced in 
[Podelski, 1992, Nivat and Podelski, 1993], which were extended to pointed 
hedges and applied to the XML setting in [Murata, 2U(J1]. Focused trees not 
only describe a tree but also its context: its previous siblings and its parent, 
recursively. Exploring such a structure has the advantage to preserve all infor- 
mation, which is quite useful when considering languages such as XPath that 
allow forward and backward axes of navigation. 
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Formally, an alphabet I] of labels, ranged over by u is assumed. 



t 
tl 



a[tl] tree 

list of trees 



e empty list 

t :: tl cons cell 



c 



context 



/ 



{tl, Top, tl) root of the tree 
(tl, c[(t], tl) context node 
{t, c) focused tree 



In order to deal with XPath containment, it is needed to represent in a 
focused tree the place where the evaluation was started using a context mark. 
To do so, we consider focused trees where a single tree or a single context node 
is marked, as in CT®[iZ] or {tl,c[a®],tl). When the presence of the mark is 
unknown, it is written as cr°[<Z]. 

J- denotes the set of finite focused trees with a single mark. The name of 
a focused tree is defined as iim(a°[tl], c) — a. Navigation in focused trees is 
now described, in binary style. Four directions can be followed: for a focused 
tree /, / (1) changes the focus to the children of the current tree, / (2) changes 
the focus to the next sibling of the current tree, / (l) changes the focus to the 
parent of the tree if the current tree is a leftmost sibling, and / ^2^ changes the 
focus to the previous sibling. 



When the focused tree does not have the required shape, these operations 
are not defined. 

5.3 Formulas of the Logic 

The logic to which XPath expressions and XML regular tree types are going 
to be translated is introduced. It is a sub-logic of the alternation free modal 
/Lt-calculus with converse. Next, a restriction on the considered formulas is 
introduced, and an interpretation of formulas as sets of finite focused trees is 
given. Then, it is shown that the logic has a single fixpoint for these models 
and that it is closed under negation. 

In the following definitions, a € {1, 2, 1, 2} are programs and atomic propo- 
sitions (T correspond to labels from E. It is also assumed that a — a. 

Formulas, defined in Fig. 5.1 include the truth predicate, atomic proposi- 
tions (denoting the name of the tree in focus) , start propositions (denoting the 
presence of the start mark), disjunction and conjunction of formulas, formu- 
las under an existential (denoting the existence a subtree satisfying the sub- 
formula), and least and greatest nary fixpoints. We chose to include a nary 
version of the latter because regular types are often defined as a set of mutually 



Formally: 



(a°[t::f/],c)(l)^='(t,(6,cK],fO) 
{t,{tli,c[<J%t' :: tlr)){2) {t',{t:: tk,c[a%tlr)) 

(t,(e,c[a°],<0)(T) = (T°[t:: tl],c) 
{f, [t :: tluc[a% tlr)) (5) = (t, {tli,c[a°],t' :: tQ) 
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T 

a I ■ 

© I 
X 

(pV ip 
(fi Alp 

{ a) f 
vXi.ipi 



T 



in ijj 
in ip 



formula 
true 

atomic prop (negated) 
context (negated) 
variable 
disjunction 
conjunction 
existential (negated) 
least n-ary fixpoint 
greatest n-ary fixpoint 



Figure 5.1: Logic formulas 



lX\v = V{X) l^ajv {/ I nm(/) ^ a} 

i^v^v Mv u mv mv {/ 1 / = {'j®m,c)} 



[(a) ^Iv = {/ (a) I / e Mv A / (a) defined} 
l^{a)Tlv = {f\f{a) undefined} 
lfJC~^, in ^1 y - let T,^{f]{T,CT\ mh[Tjx-] ^ ^l}) ^ 



^V[Ti/Xi] 



luX^.ip, in Vlv = let = (]J {t, C T \ C miviTPpu]}) ,^ 



Figure 5.2: Interpretation of formulas 



recursive definitions, making their translation in our logic more succinct. In 
the following we write ^^fiX.ip" for ^^fiX.ip in ip" . 

An interpretation of formulas as sets of finite focused trees with a single 
start mark is now given on Figure 5.2. The interpretation of the nary fixpoints 
first compute the smallest or largest interpretation for each i^i then returns the 
interpretation of ip using these bindings. 

The set of valid formulas is now restricted to cycle-free formulas, i.e. for- 
mulas that have a bound on the number of modality cycles independently of 
the number of unfolding of their fixpoints. A modality cycle is a subformula of 
the form (a) ip where tp contains a top-level existential of the form (a) ip. "Top- 
level" means under an arbitrary number of conjunctions or disjunctions, but not 
under any other construct. For instance, the formula "/iX. (1) ((pV(l) X) in X" 
is not cycle free: for any integer n, there is an unfolding of the formula with n 
modality cycles. On the other hand, the formula "/iX. (1) (XVY), Y. ^l) (F V 
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(p = T,a,^a,(§), or ^(§) A || T hf ^ A || T hf ?/> 

A\\Thf(p A II r hf V V 

Allrhfy AllrhfV; All (r<](a))hf y 

Allrhf^AV^ A||rhf-(a)T A\\Thf{a)ip 

, \ . II -p , 



yx, ex,.[{A + x,: II {T + xr. -) h;;^-^' j A II r h;j^ v 

A II r hf /uX^:^ in V 

A A II r i-«\^ 



VX, e X,. (^(A + X, : ^0 II (r + X, : _) A \\ T ^ 

A II r hf iyX~jp~i in V 

NoRec Rec ign 

X e i? T{X) = {a) X <^R A II r hf""^^^ A{X) X el 



A II r hf X A II r hf X A II r hf X 



Figure 5.3: Cycle-free formulas 



T) in X" is cycle free: there is at most one modality cycle. 

Cycle-free formulas have a very interesting property, which can now be de- 
scribed. To test whether a tree satisfies a formula, one may define a straightfor- 
ward inductive relation between trees and formulas that only holds when the 
root of the tree satisfies the formula, unfolding fixpoints if necessary. Given a 
tree, if a formula ip is cycle free, then every node of the tree will be tested a 
finite number of time against any given subformula of ip. The intuition behind 
this property, which holds a central role in the proof of lemma 5.3.2, is the 
following. If a tree node is tested an infinite number of times against a subfor- 
mula, then there must be a cycle in the navigation in the tree, corresponding 
to some modalities occurring in the subformula, between one occurrence of the 
test and the next one. As trees are considered, the cycle implies there is a 
modality cycle in the formula (as cycles of the form (1) (2) ^l) ^2) cannot oc- 
cur). Hence the number of modality cycles in any expansion of ip is unbounded, 
thus the formula is not cycle free. 

Figure 5.3 gives an inductive relation that decides whether a formula is 
cycle free. 

In the judgement A || F hf ip of Fig. 5.3, A is an environment binding 
some recursion variables to their formulas, F binds variables to modalities, R 
is a set of variables that have already been expanded (see below), and / is a 
set of variables already checked. 

The environment F used to derive the judgement consists of bindings from 
variables (from enclosing fixpoint operators) to modalities. A modality may 
be no information is known about the variable, (a), the last modality taken 
(a) was consistent, or _L, a cycle has been detected. A formula is not cycle 
free if an occurrence of a variable under a fixpoint operator is either not under 
a modality (in this case T{X) = _), or is under a cycle {T{X) = _L). Cycle 
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detection uses an auxiliary operator to detect modality cycles: 



r<{a) = {X:{T{X)<{a))} 



where 



(1) (2) (1) (2) 




(1) 
(2) 



(1) (2) (1) (2) 

(1) (2) ± (2) 

(1) (2) (1) ± 

± (2) 1 (2) 

(1) ^ (1) (2) 



_L _L _L _L 



To check that mutually recursive formulas arc cycle- free, one proceeds the 
following way. When a mutually recursive formula is encountered, for instance 
fiXi.ipi in V, every recursive binding is checked. Because of mutual recursion, 
formulas cannot be checked independently and a variable must be expanded 
the first time it is encountered (rule Rec). However there is no need to expand 
it a second time (rule NoRec). When checking ip, as the formulas bound to 
the enc;losing recursion have been checked to be cycle free, there is no need to 
further check these variables (rule Ign). To account for shadowing of variables, 
newly bound recursion variables are removed from / and R when checking a 
recursion. One may easily prove that if A || F 'rf holds, then I D R = 9- 

This relation decides whether a formula is cycle free because, if it is not, 
there must be a recursive binding of Xi to such that 'Pi{'^'/xi}{'^'/-x~} 
hibits a modality cycle above Xj, where the Xj are recursion variables being 
defined (either in the recursion defining or in an enclosing recursion defini- 
tion). 

With these definitions, a first result can now be shown: in the finite focused- 
tree interpretation, the least and greatest fixpoints coincide for cycle-free for- 
mulas. To this end, a stronger result is proved, which states that a given 
focused tree is in the interpretation of a formula if it is in a finite unfolding of 
the formula. In the base case, the formula ct A -icr is used as "false" . 

Definition 5.3.1 (Finite unfolding) A finite unfolding of a formula ip be- 
longs to the set unf{(fi) inductively defined as 



unf{ip) = {ip} for ip = T,a, -'a,®, -.(§), X, -. (a) T 
unf{ip V ip) = {ip' V ip' \ If' G unf{ip),tp' G unf{tp)} 
unf{ip A Ip) = {if' Alp' \ if' G unf{ip),tp' e unf{tp)} 
unf{{a) if) = {{a) ip' \(f' G unf{ip)} 
unfifiX-^i in V) =' «n/(^{''^ ^VxJ) 
unfiiyX;:^i in V) =^ «n/(^{'^^ - ^VxJ) 



unf{nXi.ipi imp) = u l\^u 
unf{vXi.ipi in ip) = a A -^a 

Lemma 5.3.2 Let ip a cycle-free formula. If f G \ip\v then f G lunf{(p)Jv 
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The reason why this lemma holds is the following. Given a tree satisfying ip, 
we deduce from the hypothesis that ip is cycle free the fact that every node of 
the tree will be tested a finite number of times against every subformula of ^p. 
As the tree and the number of subformulas are finite, the satisfaction derivation 
is finite hence only a finite number of unfolding is necessary to prove that the 
tree satisfies the formula, which is what the lemma states. As least and greatest 
fixpoints coincide when only a finite number of unfolding is required, this is 
sufficient to show that they collapse. Note that this would not hold if infinite 
trees were allowed: the formula ^X. (1) X is cycle free, but its interpretation 
is empty, whereas the interpretation of vX. (1) X includes every tree with an 
infinite branch of (1) children. 

We now illustrate why formulas need to be cycle free for the fixpoints to 
collapse. Consider the formula /iX. (1) (l) X. Its interpretation is empty. The 
interpretation of vX. (1) (l)^ however contains every focused tree that has 
one (1) child. 
Proof outline: 

The result is a consequence of the fact that a sub- formula is never confronted 
twice to the same node of the focused tree as there is no cycle in the formula. 
It is thus possible to annotate occurrences of v and ^ with the direction the 
formula is exploring for each variable, as in Fig. 5.3, and prove the result by 
induction on the size of focused tree in this direction. 

More precisely, each variable in every /i and v of the initial formula is given 
a unique identifier. 

The induction principle relies on the longest path of a focused tree. Given a 
tree and a direction (which may be _) , we define the longest path as the longest 
cycle-free path that starts in the initial direction. 

We then prove the property that a tree / belongs to the finite unfolding of 
Lp by induction on the lexical order of: 

1. the number of fixpoints not yet annotated; 

2. the max of the lengths of the longest path for a given unique identifier 
according to the direction for this identifier; 

3. the size of the formula. 

The interesting case is an annotated formula recursion p — ^Xi.pi in ip. 
This formula may only have been produced by an expansion. As the formula is 
cycle-free, at least one modality has been encountered since the expansion for 
each identifier associated with the Xi, and these modalities are compatible with 
the previous directions (if they existed) . The longest path for each identifier is 
thus shorter hence we have by induction that / is in a finite expansion of the 
expansion oi p. □ 

In the rest of the dissertation, only least fixpoints are considered. An im- 
portant consequence of Lemma 5.3.2 is that the logic restricted in this way is 
closed under negation using De Morgan's dualities, extended to eventualities 
and fixpoints as follows: 

—> (a) (y3 ^ (a) T V (a) ^ip 
^fiXi.ipi in = ^Xi.^ipi{^-I^Xi} in ^i/'l^'A^J 
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5.4 Translations of XML Concepts 

The interpretation of XPath expressions as sets of focused trees is given: 



Sel/Plp = SplpjrootiF) 
ScMf = '5pIp]{(„®[t,]^c)G_F} 

Selei n ealF = SelajF n Sele2jF 



Spll - Path 

sM<i]If = {f\f& SpMf a SMf} 

Spla-.-.ajF {/ I / e SJajp A iim(/) = a} 
Spla-.-.^jp^^'iflfeSaMF} 



Sql'}. : Qualif T —* {true, false} 
Sq\qx and (72I/ = ^Jgi]/ A ^^fel/ 
or «72l/ = 5g[[(7i]/ V Sq\q-iif 
<S,[not q]/ = ^ 
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Sail - Axis ^2^ ^2^ 

^alchildlF = f child(i^) U 5aIfollowing-sibling]fchiid(F) 

5a|following-sibling]F iisibling(F) U 5aIfollowing-sibling]„sibiing(F) 

5a|preceding-sibling]F psibling(F) U 5a|preceding-sibling]p3ibiii,g(j^) 

iSa|parent]F parent (i^) 

iSoldescendantlF 5a[[child]F U 5^ [descendant] (^^chiidlj.) 

iSaldescendant-or-selfjF F\J 5a [descendant] i? 

5a[ancestor]F 5a[parent]F U 5a[ancestor](5_^|pa,i.entlj.) 

5aIancestor-or-self]F '= F\J 5a [ancestor] f 

5a [following] F = 5a[descendant-or-seiq (54foiiowi„g-sibiingl(5„,.„„.to.-o,-=oiq^)) 

5a [preceding] F 5a [descendant-or-self] (54prcccding-sibii„gl(5„,.„„.,„_„_..„,^)) 

f child(F) = {/ (1) IfeFAf (1) defined} 

nsibling(i^) {/ (2) IfeFAf (2) defined} 

psibling(F) '=* {/ (2) I / e F A / (2) defined} 

parent(i^) =* {((7°[rev_a(f/;, t :: tlr)],c) 
I {t,{tk,c[a°],tlr))eF} 

rev_a(e, tlr) =' tlr 

rev_a(t :: tli,tlr) == rev_a(f/;, t :: tlr) 

root(F) = {{(J®[tl], {tl, Top, tl)) e F} U root(parent(F)) 

5.4.1 XPath Embedding 

An XPath expression can be translated into an equivalent formula in £^ which 
performs navigation in focused trees in binary style, as presented in the Sec- 
tion 4.4 of previous Chapter 4. A stronger result can be proved: 

Proposition 5.4.1 (Translation Correctness) The following hold for an 
XPath expression e and a Cf^ formula if, with tp = E^leJ^: 

2. ijj is cycle-free 

3. the size of ip is linear in the size of e and 

Proof outline: The proof uses a structural induction that "peels off" the 
compositional layers of each set of rules over focused trees. The cycle-free 



84 



Translations of XML Concepts 



Translation of following-sibling:: a/preceding-sibling: :6 
into Z;^: 6 A [fiY. (2) ( a A {fiZ. (2) ® V (2) Z) ) V (2) Y] 




Figure 5.4: Example of Back and Forth XPath Navigation Translation. 



part follows from the fact that translated fixpoint formulas are closed and 
there is no nesting of modalities with converse programs between a fixpoint 
variable and its binder. Each XPath navigation step is cycle-free, and their 
composition yields a proper nesting of fixpoint formulas which is also cycle-free. 
Figure 5.4 illustrates this on an typical example. Finally, formal translations 
do not duplicate any subformula of arbitrary length. □ 

5.4.2 Embedding Regular Tree Languages 

The straightforward isomorphism between unranked and binary regular tree 
types (presented in Section 2.1.3 of Chapter 2) is used. The translation from 
binary regular tree types into £^ is given by the function |-] as follows: 



where the formula a A ^cr is used as "false", and the function smcc.(-) takes 
care of setting the type frontier: 



according to the predicate nullable{-) (defined in Section 4.5 of previous chap- 
ter) which indicates whether a type contains the empty tree. 

Note that the translation of a regular tree type uses only downward modal- 
ities since it describes the allowed subtrees at a given context. No additional 
restriction is imposed on the context from which the type definition starts. 
In particular, navigation is allowed in the upward direction so that type con- 
straints for which only partial knowledge in a given direction is known can be 




[01 = a A -a 

m I T^i = mi V m 

1<t{Xi,X2)\ a A SMCci(Xi) A succ^^X-,) 



|let Xa^ in n fiX^.lT,} in IT} 




- (a) T V {a) X if nullable{X) 
(a) X if not nullable(X) 
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supported. However, when the position of the root is known, conditions similar 
to those of absolute paths are added. This is particularly useful when a regular 
type is used by an XPath expression that starts its navigation at the root {/p) 
since the path will not go above the root of the type (by adding the restriction 
/iZ.^ (T> T V (2) Z). 

On the other hand, if the type is compared with another type (typically to 
check inclusion of the result of an XPath expression in this type), then there 
is no restriction as to where the root of the type is (the translation does not 
impose the chosen node to be at the root). This is particularly useful since an 
XPath expression usually returns a set of nodes deep in the tree which may be 
compared to this partially defined type. 
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Satisfiability- Testing Algorithm 



6.1 Introduction 

This chapter presents the algorithm for deciding the logic introduced in previ- 
ous chapter. It is shown sound and complete, and the time complexity bound- 
ary is proved. The combination of all these ingredients leads to the main 
result: a satisfiability algorithm for a logic for finite trees whose time com- 
plexity is a simple exponential of the size of a formula. With these proofs, 
a practically effective system for solving the satisfiability of a formula is de- 
scribed. The system has been experimented with some decision problems such 
as XPath containment, emptiness, overlap, and coverage, with or without type 
constraints. 

Chapter Outline Some preliminary notions are defined in Section 6.2. The 
satisfiability algorithm is then introduced in Section 6.3 and proven correct 
in Section 6.4, with details of the implementation discussed in Section 6.5. 
Applications for type checking are described in Section 6.6 along with some 
experimental results, before the approach outcome is discussed in 6.7. 

6.2 Preliminary Definitions 

The unwinding of a formula ip = {fxXi.ipi in noted exp(ip), is defined as 
exp{ip) ip-!^p-Xi.<Pi m ^i/j^.} which denotes the formula tp in which every oc- 
currence of a Xi is replaced by {^Xi.tpi in Xi). 

The Fisher-Ladner closure cl(i/') of a formula ip is defined as the set of 
all subformulas of ^ where fixpoint formulas are additionally unwound once. 
Specifically, the relation x is defined as the least relation that 

satisfies the following: 

• (a) Lp' ->e ip' 

• fJ.Xi.(p, in Ip exp{^Xt.ipi in ijj) 
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The closure cl{tp) is the smallest set S that contains ip and closed under the 
relation i-e. ii ipi G S and ipi — >e 'P2 then Lp2 G S. 

2(7/') denotes the set of atomic propositions used in ip along with an other 
name, a^, representing atomic propositions not occurring in ip. 

The extended closure is defined as cr{tp) = c^i/)) U {^f \ S cl(V')}. Every 
formula 1^ G cl* (V') can be seen as a boolean combination of formulas of a set 
called the Lean of "0, inspired from [Pan et al., 2006]. This set is noted Lean(-!/)) 
and defined as follows: 

Lean(i/>) = {(a)T | ae {1, 2, T, 2}} U S(V') 

U {®} U {(a) V? I (a) <P e cKV-)} 

A ip-type (or simply a "type^') (Hintikka set in the temporal logic literature) 
is a set t C Lean{tp) such that: 

• V (a) (yj e Lcan('i/'), (a) (p E t ^ (a) T E t (modal consistency); 

• (1)T ^ tv(2)T ^ i (a tree node cannot be both a first child and a 
second child); 

• exactly one atomic proposition a E t (XML labeling); the function <T{t) 
is used to return the atomic proposition of a type t; 

• (S) may belong to t. 

Typ('0) denotes the set of '0-types. For a ip-type t, the complement of t is the 
set Lean(V') \ t. 

A type determines a truth assignment of every formula in cl* (ijj) with the 
relation E defined in Figure 6.L 

Note that such derivations are finite because the number of naked iiXi.ipi in 
(that do not occur under modalities) strictly decreases after each expansion. 

The notation (p E t is often used if there are some T, F such that cp E t =^ 
(T, F) . A formula ip is true at a type t iS (p E t. 

The the truth status of a formula is now related to the truth assignment of 
its 1/'- types. 

Proposition 6.2.1 If (p E t {T,F), then T C t, F C Lean{ip) \ t, and 

A^6T ^ A Av-GF -V- =^ ^- <^ ^ i (r, F), then TCt,FC Lean{^) \ t, 
and /\^^j,ip A /\^^p^i/j =^ ^ip. 

Proof outline: Immediate by induction on the derivations. □ 
A compatibility relation is now defined between types. This relation estab- 
lishes which formulas must hold in a type in order for it to be a witness for a 
modal formula. 

Definition 6.2.2 (Compatibility relation) ; Two types t,t' are compati- 
ble under a E {1,2}, written Aa{t,t'), iff 

y (a) ip E Lean{'ij}), (a) p E t ip E t' 
y (a) ip E Lean{ip) , (a) ip E t' ip E t 
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(p e Lean(V') f <Et 



^1 A ^2 e i ^ (Ti ur2,Fi UF2) ^1 V G t ^ (T^i^-P^i) 



exp{iJ,Xi.ipi in ^) G t => (T, F) <^ G Leaii(V') V 

^i^t=^ (Ti, f^i) (^2 ^ i ^ (T2, F2) ^iit=^ iTi,Fi) 



ipiVip2^t^ {TiUT2,FiUF2) <pi A(p2 ^ t ^ (Ti,Fi) 

^^2 (^2,^2) y Gt^ (r,F) 

<pi Av'2 (T2,F2) -n^^t^{T,F) 



expifiXi.ifi imp) (T, i^) 

txX;:^im^^t=^{T,F) 

Figure 6.1: Truth Assignment of a Formula 



6.3 The Algorithm 

The algorithm works on sets of triples of the form {t, wi,W2) where t is a type, 
and wi and 102 are sets of types which represent all possible witnesses for t 
according to relations Ai and A2. 

The algorithm proceeds in a bottom-up approach, repeatedly adding new 
triples until a satisfying model is found (i.e. a triple whose first component 
is a type implying the formula), or until no more triple can be added. Each 
iteration of the algorithm builds types representing deeper trees (in the 1 and 
2 direction) with pending backward modalities that will be fulfilled at later 
iterations. Types with no backward modalities arc satisfiablc. and if such a 
type implies the formula being tested, then it is satisfiable. The main iteration 
is as follows: 

X ^ 
repeat 
X' 

X ^ Upd(X') 

if FinalCheck(V',X) then 
return is satisfiable" 
until X = X' 
return "tp is unsatisfiable" 
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where X C Typ{^p)x2'^yp'^'^^ x2'^yp^'f'^ and the operations Upd(-) and Final Check(-) 
are defined on Figure 6.2. 

Upd(X)^=^'x U {(t,wi(i,X°),W2(t,X°)) I (S) ^ i C Typ(V') 

A(l)Tet^wi(t,X°)7^0 
A(2)Tet=^W2(t,X°)7^0} 

U {(t,wi(t,X°),W2(t,X°))® I (D e i C Typ(V') 

A(l)Te<^Wi(t,X°)7^0 
A(2)Te<^W2(t,X°)7^0} 

U {(t, wi(t, X®), W2(i, X°))® I (D ^ t C Typ(^) 

A(l)Te<^wi(t,X®)7^0 
A(2)Tet^W2(t,X°)7^0} 

U {(t,wi(t,X°),W2(i,X®))® I (D ^ t C Typ(V') 

A(l)Tet^Wi(t,X°)7^0 
A(2)Tei^W2(t,X®)7^0} 

Wa(i,X) = {type(.T) \ xeX /\{a)T e type(x) A Aa(t, type(a;))} 
FinalCheck(V', X) = 3x G X® , dsat(a;, V) A Va G {T, 2}, (a) T ^ type(a;) 
dsat((t, wi, W2), tp) <E tV 3a;', dsat(x', A {x' G wi V x' G ?i;2) 



X^'^'ixeXlx^ 
type((i, wi,u;2)) =^ t 

Figure 6.2: Operations used by the Algorithm. 

X* and respectively denote the set of triples and the set of types after 
i iterations: — {type(x) | x G AT'}. Note that T'^+^ is the set of types for 
which at least one witness belongs to T'. 

6.3.1 Example Run of the Algorithm 

Figure 6.3 illustrates a run of the algorithm for checking the non-emptiness of 
the simple XPath expression e = self ::b/parent::a. This expression is first 
compiled into the logic as explained in section 5.4.1. The resulting formula 
tfj = E^lejT is shown on Figure 6.3 (step 1). As a second step, Lean(^/^) is 
computed. Then the fixpoint computation starts: the set of types contains 
all possible leaves (step 3). For each type in \ T-'^, a witness must be found 
in T^. The algorithm notably finds a witness for a particular -0-type t such 
that a A (1) ip G t (step 5). finally contains 81 ■i/'-types (step 6). t happens 
to satisfy the initial formula ^ (step 7), therefore the algorithm stops just 
after computing (step 8) because the structure built by connecting t and 
its witness (as drawn on Figure 6.3) is a finite tree which contains a node on 
which tp is satisfied. Thus self ::b/parent::a is satisfiable. 
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5) Does ^{l}ip belong to ? Yes! Witness: 

return satisfiable! 

6) IT^I = 81 ^ '^^ Does satisfy 

Yes! 



3) = 

2)Lean(V') = {(l>T, (T)T, (2) T, (2)T, a, a, b, (1) (2)^} 

1) ^ = a A (1) with if = pLX.{b A (§)) V (2) X ee exp((/3) = (fe A ©) V (2) v? 
Figure 6.3: Run of the Algorithm for Checking Emptiness of self ::b/parent::a 



6.4 Correctness and Complexity 

In this section the correctness of the satisfiability testing algorithm, is proved, 
and it is shown that its time complexity is 2'^'^l'^'^^"'^''')l^ 

Theorem 6.4.1 (Correctness) The algorithm decides satisfiability of for- 
mulas over finite focused trees. 



Termination For tp £ C^, since cl{ip) is a finite set, Lean('i/;) and 2'"°™^''') 
are also finite. Furthermore, Upd(-) is monotonic and each is included in 
the finite set Typ('0) x 2'^yp('^) x 2'^yp'''f'\ therefore the algorithm terminates. 
To finish the proof, it thus suffices to prove soundness and completeness. 



Preliminary Definitions for Soundness First, a notion of partial satisfia- 
bility is introduced for a formula. In this partial satisfiability notion, backward 
modalities are only checked up to a given level. A formula if is partially satisfied 
iff as defined in Figure 6.4. 

For a type t, ipc(t) denotes the most constrained formula, where atoms are 
taken from Lean(-0). In the following, o stands for (g) if @ e and for ^(S) 
otherwise. 

(Pc{t) = (T{t) A /y ^(7 A o A (a) LP A /y -^{a)ip 

A notion of paths is now introduced. Paths written p are concatenations of 
modalities: the empty path is written e, and path concatenation is written pa. 
Every path may be given a depth: 

depth{e) = 
depth{pa) == depth{p) + 1 if a G {1, 2} 
depth{pa) depth{p) — 1 if a G {1, 2} 
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l{2) ^fy'^T 



Ip1^ = {/|nm(/)=rt 

i^prv = {i\Mf)^p} 

I(Dlv - {/l/ = (^®M,c)} 



[(T) vr>^ {/ (1) I / e M--^ A / (1) defined} 
1(2) ^1^^>° {/ (2) I / e A / (2) defined} 

1(1) ^Tv - {/ (T) I / e My+' A / (T) defined} 
1(2) {/ (2) I / e A / (2) defined} 

I-(a)Tl^"=^'{/|/(a) undefined} 
lfJC~^i in VIS^ = let T, = (fl {ll C J I miyfjyjr,^ ^ ^}), 

Figure 6.4: Partial Satisfiability 



A forward path is a path that only mentions forward modalities. 

A tree of types T is defined as a tree whose nodes are types, T(») = t, 
with at most two children, T (1) and T (2). The navigation in tree of types is 
trivially extended to forward paths. A tree of types is consistent iff for every 
forward path p and for every child a of T (p), the following holds: T (p) (•) = t, 
T (pa) (•) = t' implies (a) T e (a) T € t' , and Aa{t,t'). 

Given a consistent tree of types T, a dependency graph is now defined. In 
this graph, nodes are pairs of a forward path p and a formula vnt = T (p) (•) 
or the negation of a formula in the complement t. The directed edges of the 
graph are modalities consistent with the tree. For every (p, ip) in the nodes the 
following edges are built: 

• cpG E(V') U -^T,{tp) U {(§), -.(D, (a) T, -. (a) T}: no edge 

• p = e,ip = {a) (fi' with a G {1, 2}: no edge 

• p = p'a, if = {a') ip': let t = T (p) (•). Let first consider the case where 
a' e {1,2} and let t' = T {pa') (•). As T is consistent, (p' £ t' hence 
there are T, F such that (p' e f ^ (T, F) with T a subset of t' , and F 
a subset of the complement of t'. For every (px & T an edge a' is added 
to (pa'jipx), and for every ipp € F a,n edge a' is added to (pa' , ^ipp). 
Consider now the case where a' £ {1,2} and first show that a' = a. As 
T is consistent, (a) T in t. Moreover, as Ms a tree type, it must contain 
(a') T. As a' is a backward modality, it must be equal to a as at most 
one may be present. Hence p'aa' = p' holds. Let t' = T(p')(»). By 
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consistency, ip' € t', hence ip' G t' => (T, F') and edges are added as in 
the previous case: to [p' ,lpt) and to (p'^^ipp). 

• p = p'a^ If = ^ (a') ip': let t — T (p) (•). If (a') T is not in t then no edge 
is added. Otherwise, one proceeds as in the previous case. For downward 
modalities, let t' = T {pa') (•) and compute <p' ^ t! (T, F) which is 

known to hold by consistency. Edges are then added to (pa'^ipx) and 
to {pa' y-iifp) as before. For upward modahties, as (a') T holds in t, one 

must have a' = a and let t' = T {p') (•). {p' ^ t' (T, F) is computed 
and edges are added to {p',(Pt) and to {p' ,^ipf) as before. 

Lemma 6.4.2 The dependency graph of a consistent tree of types of a cycle- 
free formula is cycle free. 

Proof outline: The proof proceeds by induction on the depth of the cycle, rely- 
ing on the fact that the dependency graph is consistent with the tree structure 
(i.e. if a 1 edge reaches a node, no 2 edge may leave this node). The induction 
case is trivial: if there is a cycle of depth n, there must be a cycle of depth 
n — 1, a contradiction. 

The base case is for a cycle of depth 1. One case is described, where the 
cycle is (p, (1) <p) — >^ {pi, (T) ^p) — >^ {p, (1) ip). As ip must be a subformula of 
tp and tp a subformula of (p, they are both recursive formula. An analysis of the 
shape of ip, based on the derivations ip & t => (T, F) and tp & t' => (T', F') 
with {l)tp S T and (l) </? e T' then shows that ip is not a cycle- free formula, a 
contradiction. □ 



Lemma 6.4.3 (Soundness) Let T be the result set of the algorithm. For any 
type t gT and any ip such that ip Gt, then [t^Jg ^ 0. 

Proof outline: 

The proof proceeds by induction on the number of steps of the algorithm. 
For every t in T" and every witness tree T rooted at t built from X", one can 
show that T is a consistent tree type and one can build a focused tree / that 
is rooted (i.e. of the shape (cr°[fZ], (e, Top, tl'))). The tree / is in the partial 
interpretation of Pc{t): f {p) G |</?c(^ (p) (•))]0^^*'''''''' for any path p whose 
depth is or more, and / contains the context marker only if (S) occurs in T. 
Then one shows that for all e i, / € \ipfl holds. 

The base case is trivial by the shape of t: it may only contain backward 
modalities (trivially satisfied at level 0), one atomic proposition, and one con- 
text proposition. Moreover there is only one tree of witnesses to consider, the 
tree whose only node is t. If the atomic proposition is a, then the focused 
tree returned is either (iJ®[e], (e. Top, e)) or (cr[e], (e, Top, e)) depending on the 
context proposition. 

In the inductive case, every witness types for both downward modalities, ti 
and ^2 are considered. For each of them, every tree type Ti and T2 are consid- 
ered and a tree type rooted at t is built which is consistent by definition of the 
algorithm. By induction, /i and /2 such that /i {p) € \ipc{T {Ip) (•))]0*^*''^^^ 

and /2 {p) G \ipc{T {2p) (•))]0''^"'*''''' for any path p whose depth is or more. 
If either 71 or 72 contains (S), then /i or /2 contains the context marker by 
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induction. Moreover, by definition of the algorithm, it is the case for only one 
of them and (S) is not in t. 

Let /i be K[t;i],(e, Top^tn)) and be {a^ithhie, Top,tr2)). Let / - 
{a(t)°[al[tli] :: tri],(e, Top^a^lth] :: trs)) where cr(t)° is CT(t)® if (S) g i, and 
a{t) otherwise. Note that / contains exactly one context marker iff (S) G T. 

Next, one shows that A (p) € y^iT (Ip) i*))}^''^'^'''''^ implies / (Ip) e 

|(y5c(^(lp) (•))l0'^^"'*'''\ and the same for the other modality, by induction on 
the depth of the path, remarking that every backward modality at level is 
trivially satisfied. 

Then one proceeds to show that / satisfies (pc{t) at level 0. To do so, a 
further induction on the dependency tree is needed. Let p be a path of the 
dependency tree and '0 be a formula at that path in the dependency tree, one 
shows that /(p) e mpP^'^Pl To do so, one relies on / (p> e M^"''*''^'''"' if 
depth{p) ^ 0. In the base case at depth 0, the result is by construction as the 
formula is either a backward modality or an atomic formula. In the base case 
at another depth, the case is immediate by induction as the formula has to be 
an atomic formula whose interpretation does not depend on the depth. In the 
induction case, one concludes by the inductive hypothesis and by definition of 
partial satisfiability. 

The proof is concluded by noticing that the final selected type has no back- 
ward modahty, hence |<y5c(i)lo — [¥'c(^)l0- 

□ 

Lemma 6.4.4 (Completeness) For a cycle-free closed formula (p e if 
Ivld 7^ ^ then the algorithm terminates with a set of triples X such that 
FinalCheck(ip, X). 

Proof outline: Let / G [(pjg be a smallest focused tree validating the formula 
such that the names occurring in / are either also occurring in ip or are a single 
other name ax- By Lemma 5.3.2, there is a finite unfolding of ip such that / 
belongs to its interpretation. Hence there is a finite satisfiability derivation, 
defined in Figure 6.5, of / Ih^ (p. 

In the satisfiability derivation, paths are assumed to be normalized (11 — e). 
Hence every path is a concatenation of a (possibly empty) backward path pf, 
followed by a forward path p f . 

This derivation has the following property, immediate by induction: let / 
the initial focused tree, then /' Ihp ip implies f = f (p). Hence if /i Ihp ipi and 
/2 ll-p f2, then /i /2. 

Next, one uses the satisfiability derivation to construct a run of the algo- 
rithm that concludes that ip is satisfiable. One first associates each path to a 
type, which one then saturates (adding formulas that are true even though the 
satisfiability relation does not mention them at that path). One next shows 
that every formula at a path in the satisfiability relation is implied by the type 
at that path, and that types are consistent according to the Aa{t,t') relation. 
One then concludes that the types are created by a run of the algorithm by 
induction on the paths. 

More precisely, let first describe how tp is built. Let <i>p the set of formulas 
at path p. One first adds every formula of <i>p that is in Lean((^), then one 
completes this set to yield a correct type: if (a) ip G then one adds (a) T; 
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nm(/) = a nin(/) ^ a 

/I^T jWpCj f Ih^ -.a {a®[tl],c)\hp® 

f\^pV /I^V- /l^^ /I^V' 



/(I) /(2) Ihp2y /(l)l^iy /(2) 1^2'/^ 

/lh,(l)(^ /IHp(2)<^ /I^<1><^ /l^(2)<p 



/ (a) undefined / Ih^ exp(/iXj.yii in ^) 

/ Ihp (a) T / Ihp /iXj.i^j in V 

Figure 6.5: Satisfiability Relation 



for every modality a for which / (a) is defined one adds (a) T; if there is no 
atomic proposition in $p then one adds im(/ (p)); finally if / (p) has the context 
marker one adds (S). 

One next saturates the types. For every path tp if tpa exists, if (a) tp € 
Lean((/p), and if tp ^ tpa then one adds (a) tp to tp. This procedure is repeated 
until it does not change any type. Termination is a consequence of the finite 
size of the lean and of the number of paths. The resulting types are satisfiable 
as they are before saturation (since a focused tree satisfies them) and each 
formula added during saturation is first checked to be implied by the type. 

One next shows (*): for any given path p, if ipp G $p then ipp G tp, by 
induction on the satisfiability derivation. Base cases with no negation are 
immediate by definition of tp as these are formulas of the lean. For base cases 
with negation, one relies on the fact that / (p) satisfies the formula, hence one 
cannot for instance have a and -icr in $p. If (a) T e $p then one cannot 
also have (a) ip G <l>p as pa is not a valid path, hence (a) T is not in tp thus 
-I {a)T € tp. The inductive cases of this induction (disjunction, conjunction, 
recursion) are immediate as they correspond to the definition of • e •. 

One next shows that for every type tp and tpa where a is a forward modal- 
ity, (a) T e tpa and Aa{tp,tpa) hold. (Note that, by path normalization, the 
types considered may be tj2 and tj for modality 2.) The first condition is im- 
mediate by construction of tpa as / (pa) is defined. For the second condition, 
let (a) tp tp. If (a) -0 G <i>p, then it occurs in the satisfiability derivation with 
an hypothesis fpa \\-pa t/j. In this case tp e tpa holds by (*). If (a) tp ^ ^p then 
it was added during saturation and the result is immediate by construction. 
Conversely, if tjj G tpa then by saturation (a) tjj G tp. The case (a) tp G tpa 
is now considered. The proof goes exactly as before, distinguishing the case 
where the formula is in $pa and the case where it was added by saturation. 

One now shows that there is a run of the algorithm that produces these 
types. The proof proceeds by induction on the paths in the downward direction; 
if tpa has been proven for a partial run for a G {1,2}, then tp is proven for 
the next step of the algorithm. Moreover, one shows that {tp, {tpi}, {tp2}) is 
marked iff a forward subtree of / (p) contains the context mark. The base case 
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is for paths with no descendants, hence no witness is required. The algorithm 
then adds (tp, 0, 0) to its set of types, with a mark iff (S) G tp, iff / {p) is marked. 

The inductive case is now considered. By induction, a partial run of the 
algorithm returns tpi and/or tp2. One first shows that tp is returned in the next 
step of the algorithm, taking these two types as witnesses. One first remarks 
that if either witness is marked then the other is not and the mark is not at 
f (p), since there is only one context mark in /, and if the mark is at f (p), 
then neither witness is marked. For each child a G {1;2}, Aa{tp,tpa) and 
(a) T etpa, hence the triple {tp, Wi, W2) with tpi € Wi and tp2 S W2 is added 
by the algorithm. 

One may now conclude. At the end of the induction, the last path consid- 
ered, po, has no predecessor, hence it is the longest backward only path. Since 
/ (po) is the root of the tree, (l) T ^ tp^ and (2)T ^ tp^. Moreover, as the 
context mark is somewhere in /, it is in a forward subtree of / (po), hence the 
final type is marked. Finally, t^ is in the witness tree of the final type, and 
since f \\-^ ip, ip G t^. □ 

Lemma 6.4.5 (Complexity) For a formula t/j e £p the satisfiability problem 
hPJii 7^ ^ decidable in time 2'^'-"^ where n = \Lean{^)\. 

Proof outline: |Typ(?/;)| is bounded by |2'"°™('^)| which is 2*^^"^. During each 
iteration, the algorithm adds at least one new type (otherwise it terminates), 
thus it performs at most 2*^^"^ iterations. What is done at each iteration is now 
detailed. For each type that may be added (there are 2*^^"^ of them), there are 
two traversals of the set of types at the previous step to collect witnesses. Hence 
there are 2 * 2'^'^"^ * 2'-'*^"^ = 2'^^") witness tests at each iteration. Each witness 
test involves a membership test and a test. In the implementation these 
are precomputed: for every formula (a) ip in the lean, the subsets (T, F) of the 
lean that must be true and false respectively for p to be true are precomputed, 
so testing ip € t are simple inclusion and disjunction tests. The FinalCheck 
condition test at most 2*^'^"^ -0- types and each test takes at most 2*^^") (testing 
the formulas containing (S) against Therefore, the worst case global time 
complexity of the algorithm does not exceed 2*^*^"^. □ 

6.5 Implementation Techniques 

This section describes the main techniques used in the complete implementation 
[Geneves, 20(j( ] of the £^ decision procedure. 

6.5.1 Implicit Representation of Sets of ^- Types 

The implementation relies on a symbolic representation and manipulation of 
sets of types using Binary Decision Diagrams (BDDs) [Bryant, 198G]. BDDs 
provide a canonical representation of boolean functions. Experience has shown 
that this representation is very compact for very large boolean functions. Their 
effectiveness is notably well known in the area of formal verification of systems 
[Edmund M. Clark<> <>t al. 1999]. 

First, one may observe that the implementation can avoid keeping track of 
every possible witnesses of each i/i-type. In fact, for a formula tp, one can test 
|<y5]0 by testing the satisfiability of the (linear-size) "plunging" formula 
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ip = iiX.ifi V (1) X V (2) X at the root of focused trees. That is, checking 
IV'lfl 7^ while ensuring there is no unfulfilled upward eventuality at top level 
0. One advantage of proceeding this way is that the implementation only need 
to deal with a current set of V^-types at each step. 

A bit-vector representation of ^/'-types is now introduced. Types are com- 
plete in the sense that either a subformula or its negation must belong to a 
type. It is thus possible for a formula ip G hea,n{'ip) to be represented using 
a single BDD variable. For Lea.n{ip) = {ipi, ...,ipm}, a subset t C Lea.n{ip) is 
represented by a vector t = {ti, ...,tm) S {0, 1}™ such that ipi G t iS U = 1. A 
BDD with m variables is then used to represent a set of such bit vectors. 

For a program a G {1,2}, some auxiliary predicates on a vector t are defined: 

• isparent(j(i) is read H is a parent for program a" and is true iff the bit 
for (a) T is true in t 



ischilda(i) is read H is a child for program a" and is true iff the bit for 
(a) T is true in t 



For a set T C 2^''™('^) , its corresponding characteristic function is denoted 
Xt- Encoding XTyp(-^) is straightforward with the previous definitions. 
The equivalent of € is defined on the bit vector representation: 



status^ (t) 



ti if (/? G Lean(i/') 

status^' (t) A status^" (t) ii (p = if' A ip" 

status^' (t) V status,^'/ (f) if ip = ip' V (p" 

^statuS;^' (?) if (p — -^(p' 

statusexp(,^)(i) if (yj = iJ.Xi.(pi in ip 



a ^ b and a ^ b respectively denote the implication and equivalence of 
two boolean formulas a and b over vector bits. The BDD of the relation 
for a e {1,2} can now be constructed. This BDD relates all pairs {x,y) that 
are consistent w.r.t the program a, i.e., such that y supports all of :e's {a) ip 
formulas, and vice-versa x supports all of ifs (a) ip formulas: 




Xi status^ (y) if pi = (a) ip 
Aa{x,y)= /\ { j/j status^ (x) if ip^ ^ (a) ip 
T otherwise 

For a e {1, 2}, the set of witnessed vectors is defined: 

Xwit„(T)(^) = isparent„(x) 3y [ h{y) A Aa{x,y) ] 

where h{y} = Xriv) A ischilda(y). 

Then, the BDD of the fixpoint computation is initially set to the false 
constant, and the main function Upd(-) is implemented as: 

Xupd(T) {x) = Xt{x) V XTyp(V) {x) A /\ Xwit„(T) {x) 
\ ae{l,2} 

Finally, the solver can be implemented as iterations over the sets Xupd(T) 
until a fixpoint is reached. The final satisfiability condition consists in checking 
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whether ip is present in a ip-type of this fixpoint with no unfulfilled upward 
eventuality: 



3t 



Xt(^ A y/y ^ischilda(i) A status^ (t) 

ae{l,2} 



6.5.2 Satisfying Model Reconstruction 

The implementation keeps a copy of each intermediate set of types computed 
by the algorithm, so that whenever a formula is satisfiable, a minimal satisfying 
model can be extracted. The top-down (re)construction of a satisfying model 
starts from a root (a ^/i-type for which the final satisfiability condition holds), 
and repeatedly attempts to find successors. In order to minimize model size, 
only required left and right branches are built. Furthermore, for minimizing the 
maximal depth of the model, left and right successors of a node are successively 
searched in the intermediate sets of types, in the order they were computed 
by the algorithm. For readability purposes, the extracted satisfying model can 
be enriched by annotating the context mark (S) from which XPath evaluation 
started and a target node selected by the XPath expression. The annotated 
model is then provided to the user in XML unranked tree syntax. 



6.5.3 Conjunctive Partitioning and Early Quantification 

The BDD-based implementation involves computations of relational products 
of the form: 

3y [h{y)AAa{x,y)] (6.1) 

It is well-known that such a computation may be quite time and space consum- 
ing, because the BDD corresponding to the relation may be quite large. 

One famous optimization technique consists in using conjunctive partion- 
ing [Edniund M. C'larko ot al., 1999] and early quantification [i'an ol al., 'JOUG]. 
The idea is to compute the relational product without ever building the full 
BDD of the relation Aq. This is possible by taking advantage of the form of 
Aq along with properties of existential quantification. By definition, Aq is a 
conjunction of n equivalences relating x and y where n is the number of (6) ip 
formulas in Lean(^) where (p ^ T and b e {a, a}: 

n 

Aa{x,y) = /\ Ri{x,y) 

i=l 

If a variable yk does not occur in the clauses Ri+i, Rn then the relational 
product (6.1) can be rewritten as: 



Hy) A Ai<j<, Rji^, y) a A.,+i<;<„ Ri{x, y) 



yi, ...,yk-i,yk+i, ■■■,yn 



This allows to apply existential quantification on intermediate BDDs and 
thus to compose smaller BDDs. Of course, there are many ways to compose 
the Ri{x,y). Let p be a permutation of {0, ...,n — 1} which determines the 
order in which the partitions Ri{x,y) are combined. For each i, let Di be the 
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set of variables yu with k G {1, ...,to} that Ri{x,y) depends on. Ei is defined 
as the set of variab 
any j larger than i: 



as the set of variables contained in £'p(i) that are not contained in for 



j=t+i 

The Ei are pairwise disjoint and their union contains all the variables. The 
relational product (6.1) can be computed by starting from: 

h,{x,y)^ 3 [h{y)ARp(o){x,y)] 
Vk G Eo 

and successively computing /ip+i defined as follows: 

3 [ /ip(x,y) A i?p(p)(x,y) 



hpix, y) A Rp(p) (f , y) if £'p 

until reaching /i„ which is the result of the relational product. The ordering 
p determines how early in the computation variables can be quantified out. 
This directly impact the sizes of BDDs constructed and therefore the global 
efficiency of the decision procedure. It is thus important to choose p carefully. 
The overall goal is to minimize the size of the largest BDD created during the 
elimination process. A heuristic taken from [Edmund M. Clarke rt al., 1999] 
is used. It seems to provide a good approximation as in practice it yields the 
best observed performance. It defines the cost of eliminating a variable y^ as 
the sum of the sizes of all the Di containing y^: 



l<z<n,yfctEl)i 

The ordering p on the relations i?j is then defined in such a way that variables 
can be eliminated in the order given by a greedy algorithm which repeatedly 
eliminates the variable of minimum cost. 



6.5.4 BDD Variable Ordering 

The cost of BDD operations is very sensitive to variable ordering. Finding the 
optimal variable ordering is known to be NP-complete [Hojati ct al.. 1996]. 
However, several heuristics are known to perform relatively well in practice 
[Edmund M. Clarke ct al., 1999]. Choosing a good initial order of Lean(V') 
formulas does significantly improve performance. To this end, preserving lo- 
cality of the initial problem happens to be essential. Experience has shown that 
the variable order determined by the breadth-first traversal of the formula "0 
to solve, which keeps sister subformulas in close proximity, yields better results 
in practice. 



6.6 Typing Applications and Experimental Results 

For XPath expressions ei, ...,e„ G >CxPath, the decision problems presented in 
Section 4.6 can be generalized in the presence of several XML type expressions 
Ti, ...,T„ and formulated as follows: 
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• XPath containment: £^^|ei]((DA|Ti]) ^ ^^^1^21 (©a IT2]) (if the formula 
is unsatisfiable then all nodes selected by ei under type constraint Ti are 
selected by 62 under type constraint T2) 

• XPath emptiness: ^^^Ieil(@AlTil) 

• XPath overlap: £'^Ieil(@AlTil) A £;~*|e2l(@AlT2l) 

• XPath coverage: £^^Ieil(@AlTil) A A2<i<„ ^^^M((DaIt.1) 

The advantage of generalizing all the previous problem formulations with 
distinct types Ti and T2 is particularly useful for applications where types 
evolve. For instance, it is common that a file format of some company (de- 
scribed by an XML schema for instance) evolves over time. In this case, trans- 
formations that operated on the old document type must be updated to oper- 
ate on the new type. Analysing XPath queries of a transformation (written in 
XSLT for instance) under different type constraints (the old one and the new 
one) can be used for helping the programmer to identify and understand the 
consequences of the evolution of the document type. 

The system can also be used to check basic subtyping: jTi] A ^|r2]. How- 
ever, since XPath (and therefore reverse navigation) is not used in that case, 
algorithms specialized for this restricted case such as the ones proposed in 
[Hosoya and Pierce, 201)3] or in [Tozawa and Hagiya, 2003] may perform bet- 
ter on practical instances. 

Additionally, two decision problems are of special interest for XML static 
type checking: 

• Static type checking of an annotated XPath query: £'^Iei](@A|Ti]) A 
^[JT2] (if the formula is unsatisfiable then all nodes selected by ei under 
type constraint Ti are included in the type T2.) 

• XPath equivalence under type constraints, checked by -B^|ei](@;^|7-j) A 
^£;^|e2l((DAiT2l) and ^£'^[eil((DAiTil) A £;^|e2l(@AiT2l) (This test can 
be used to check that the nodes selected after a modification of a type Ti 
by T2 and an XPath expression ei by 62 are the same, typically when an 
input type changes and the corresponding XPath query has to change as 
well.) 

6.6.1 Experimental Results 

Extensive tests of the implementation [Geneves. 2006] have been carried out""^. 
This section gathers a few of them. All times reported correspond to the actual 
running time (in milliseconds) of the satisfiability solver without the extra 
(negligible) time spent for parsing XPath and translating into C^. 

First, an XPath benchmark [Frant i^scliet. 2(jO.")] is used. Its goal is to cover 
XPath features by gathering a significant variety of XPath expressions met in 
real- world applications. In this first test series, types are not yet considered, 
and the focus is only given to the XPath containment problem, since its logical 
formulation (presented in Section 4.6) is the most complex (as it requires the 
logic to be closed under negation). This first test series consists in finding 

^Experiments have been conducted with a Java implementation running on a Pentium 
4, 3 Ghz, with 512Mb of RAM with Windows XP. 
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qi /site/regions/*/item 

g2 /site/auctions/auction/annotation / description / par list /listitem / text /keyword 

gs //keyword 

g4 /descendant-or-self : distitem / descendant-or-self ; :key word 

gs /site/regions/*/item[parent::namerica or parent: isamerica] 

ge //keyword/ancestoridistitem 

g7 / /key word/ancestor-or-self : mail 

gg /site/regions/namerica/item i /site/regions/samerica/item 

gg /site/people/person[address and (phone or homepage)] 

Figure 6.6: Queries Taken from the XPathmark Benchmark. 



the relation holding for each pair of queries from the benchmark. This means 
checking the containment of each query of the benchmark against all the others. 
Qi C qj denotes that the query qi is contained in the query qj . Comparisons of 
two queries qi and qj may yield to three different results: 

1- 9i ^ qj and qj C qj, the queries are semantically equivalent, which is 
denoted by qi = qj 

2. 9i ^ qj but qj % qi, denoted by qi C qj or alternatively by qj D qi 

'i- (li % Qj and qj % qi, queries are not related, denoted by qi / qj 

Queries are presented on Figure 6.G (where "//" is used as a shorthand for 
"/descendant-or-self::*/"). Corresponding results together with running times 
of the decision procedure are summarized on Table 6.1. Obtained results show 
that all tests are solved in several milliseconds. These first results suggest 
that several XPath expressions used in real-world scenarios can be efficiently 
handled in practice. 

As a second test series, several expressions found in research papers on 
the containment of XPath expressions are compared. Figure 6.7 presents the 
collected expressions. Figure 6.7 also shows the obtained results. The first con- 
tainment instance of Figure 6.7 was first formulated in [Miklau and Suciu, 2004] 
as an example for which the proposed tree pattern homomorphism technique is 
incomplete. The third example was not solvable in acceptable time and space 
bounds using the technique based on WS2S presented in Chapter 3. For this 
instance, the technique is orders of magnitude faster, and yields acceptable 
memory footprints. These results suggest that the system is reasonably able to 
handle containment instances which are difficult or impossible to solve using 
other techniques. 

Figure 6.8 presents the results of a third test series including examples 
with intersection, and axes such as "following" and "preceding" , which are not 
illustrated in the previous series. 

In the fourth test series, several XPath expressions (shown on Figure 6.9) 
are used in the presence of two real- world XML types: the DTDs of the SMIL 
[Hoschka, 1998] and XHTML [I'onibortou, 2000] W3C recommendations. Ta- 
ble 6.2 gives the size of each DTD by presenting the number of symbols used 
(alphabet size) and the number of grammar production rules (type variables) 
in the unranked and binary representations. Several decision problems and 
their results are presented on Table 6.3. For example, the emptiness test for 
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Relation 


Time (ms) 


C 


D 


Qi 7^ 92 


17 


21 


Qi 7^ Q3 


13 


20 


Qi 7^ Qi 


12 


16 


qi D 95 


14 


9 


ll 7^ 96 


21 


17 


9i / n 


13 


11 


91 3 98 


8 


13 


91 7^ 99 


14 


17 


92 C 93 


32 


35 


92 C 94 


33 


38 


92 / 95 


24 


22 


92 7^ 96 


21 


38 


92 7^ 97 


30 


31 


92 7^ 98 


22 


23 


92 / 99 


35 


37 


93 3 94 


14 


23 


93 7^ 95 


7 


9 


93 7^ 96 


5 


8 



Relation 


Time (ms) 


C 


D 


93 7^ 97 


13 


11 


93 7^ 98 


16 


4 


93 7^ 99 


13 


16 


04 0"; 


22 


14 


94 7^ 96 


5 


12 


94 / 97 


22 


11 


94 7^ 98 


13 


17 


94 7^ 99 


15 


17 


95 / 96 


10 


10 


95 7^ 97 


13 


8 


95 ^ 98 


9 


14 


95 7^ 99 


17 


21 


96 7^ 97 


21 


22 


96 7^ 98 


17 


17 


96 7^ 99 


13 


19 


97 7^ 98 


22 


19 


97 7^ 99 


14 


17 


98 7^ 99 


9 


11 



Table 6.1: Results for Comparisons of Benchmark Queries. 



ei /a[.//b[c/*//d]/b[c//d]/b[c/d]] 

62 /a[.//b[c/*//ci]/b[c/d]] 

63 a[b]/Vd/Vg 

64 a[b]/(b , c)/d/(clf)/g 

65 (a[b]/b/d/e/g) , (a/b/d/f/g) 

66 a/b/s//c/b/s/c//d 

67 a//b/Vc//Vd 

68 a[b/e][b/f][c] 

69 a[b/e][b/fl 

610 / descendant : :editor [parent : :j ournal] 

611 / descendant-or-self : :journal /editor 

Figure 6.7: Results for Instances Found in Research Papers. 



Relation 


Time (ms) 


C 


D 


ei C 62 


323 


248 


63 D 64 


18 


25 


63 3 65 


23 


17 


64 D 65 


24 


25 


66 C 67 


37 


30 


68 C eg 


8 


9 


eio = en 


17 


14 
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ei2 a/b//c/ following-sibling: : d/e 

ei3 a//d[preceding-sibling::c]/e 

ei4 / /&/ /h/ /c/following-sibling: :d/e 

ei5 / /b[ancestor::a]//*[preceding-sibling::c]/e 

ei6 /b [preceding: :a] //following: :c 

ei7 /a/b//following::c 

ei8 a/b[//c]/following::d/e 

ei9 a//d[preceding::c]/e 

620 a/b/ /d [preceding-sibling: :c]/e 

621 a/c/following::d/e 

622 a/d[preceding::c]/e 

623 a/b[//c]/following::d/e n a/d [preceding: :c]/e 

624 a/c/following::d/e n a/d[preceding::c]/e 

Figure 6.8: Results for Instances with Horizontal Navigation. 

P5 switch/layout 

P6 smil/head//layout 

P7 smil/head/ /layout [ancestor: :switcli] 

ps */ /switch[ancestor::liead]/descendant::seq/ /audio[preceding-sibling::video] 

pg descendant: :a[ancestor::a] 

pio /descendant::* 

pii litml/(liead 1 body) 

P12 html/head/descendant::* 

Pi3 html/body/descendant::* 

Pi4 //img 

pi5 //img [not *] 

Figure 6.9: Queries Used in the Presence of DTDs. 



DTD 


Labels 


Tree Type Variables 


SMIL 1.0 [Hoschka, 1998] 
XHTML 1.0 [Pcmberton, 2000] 


19 
77 


29 unranked, 11 binary 
104 unranked, 325 binary 



Table 6.2: Types Used in Experiments. 



pg shows that the official XHTML DTD does not syntactically prohibit the 
nesting of anchors. Obtained results suggest that deciding XPath problems 
remains practically feasible, especially for static analysis purposes where such 
operations are performed at compile-time. 

An additional benefit of the technique is that it automatically outputs a 
satisfying XML document, which can easily be enriched with XPath context 
and target information. For instance, the solver trace for the emptiness test 
for ps is given below: 

Checking emptiness of 

♦//switch [ancestor : : head] /descendant : : seq//audio [preceding-sibling: : video] 
in the presence of ' smil . dtd' . 
Parsing XPath [249 ms] . 

Compilation of XPath to Tree Logic Formulas [15 ms] . 
Input DTD read from ' sampleDTDs/smil . dtd' . 
Start symbol is $smil 



Relation 


Time (ms) 


C 


D 


ei2 C ei3 


23 


17 


ei4 C ei5 


12 


23 


ei6 C ei7 


18 


22 


ei8 C ei9 


17 


15 


620 = ei2 


23 


24 


621 / 622 


15 


19 


623 C 621 


22 


19 


624 / 618 


16 


11 
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Question 


Instance 


DTD 


Answer 


Time (ms) 


Non-Emptiness 


P5 


SMIL 


yes 


56 


Overlap 




SMIL 


no 


75 


Containment 




SMIL 


no 


81 


Non-Emptiness 


P8 


SMIL 


yes 


94 


Non-Emptiness 


P9 


XHTML 


yes 


2530 


Coverage 


PlQ C pii Upi2 Upi3 


XHTML 


yes 


2723 


Containment 


PU C pi5 


XHTML 


yes 


2937 



Table 6.3: Results in the Presence of DTDs. 



Converted DTD into BTT [140 ms] . 

CFT: 29 type variables and 19 terminals. 

BTT: 11 type variables and 17 terminals. 

Translated BTT into Tree Logic [16 ms] . 

Computing Relevant Closure 
Computed Relevant Closure [46 ms] . 
Computed Lean [0 ms] . 

The Lean has size 53. It contains 35 eventualities and 18 symbols. 
Fixpoint Computation Initialized [31 ms] . 

Computing Fixpoint [94 ms] . 

Formula is satisf iable [171 ms] . 

A satisfying finite binary tree model was found [94 ms] : 
smil(head(switch(seq(video(#, audio), layout), meta) , #) , #) 
In XML syntax: 
<smil context="true"> 
<head> 
<switch> 
<seq> 
<video/> 

<audio target="true"/> 
</seq> 
<layout/> 
</switch> 
<meta/> 
</head> 
</smil> 

*//switch [ancestor : : head] /descendant : :seq//audio [preceding-sibling: : video] 
is satisf iable in presence of 'smil.dtd' 

These experiments shed a first light on the cost of solving XML decision 
problems in practice, and suggest that the presented system is already able to 
handle realistic scenarios. 

6.7 Outcome 

The essence of the obtained results lives in a sub-logic of the alternation free 
modal /U-calculus with converse, with some syntactic restrictions on formulas. 
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and where models are finite trees. As detailed in Chapter 5, the syntactic 
restrictions allow to prove that formulas of the logic are cycle-free. The cycle- 
free property is used to prove that the least and greatest fixpoint operators 
collapse in a single fixpoint operator. This provides closure under negation, 
which is the key property for solving the containment (a logical implication). 
Deep connections between this logic and XML decision problems can then be 
revealed: XPath expressions and regular tree type formulas conform to the 
Cfj, syntactic restrictions. Furthermore, XPath expressions and regular tree 
languages can surprisingly be efficiently embedded since they are linear in the 
size of the corresponding formulas in the logic. 

A sound and complete algorithm for testing the satisfiability of the logic is 
described in this chapter. Its upper bound time complexity is 2'^(") w.r.t. the 
length n of the given formula. The combination of all these ingredients yields 
the main result: sound and complete decision procedures, with the same upper 
bound complexity, for XML decision problems involving regular tree types and 
XPath queries. The implementation appears efficient in practice. A benefit of 
the approach is that the system can be effectively used in static analyzers for 
programming languages manipulating both XPath expressions and XML type 
annotations (input and output). 
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7.1 Summary of the Main Contributions 

The main contribution of this thesis is a new logic for finite trees, derived 
from the /z-calculus. The logic is expressive enough to capture regular tree 
types along with multi-directional navigation in finite trees. It is dccidablc in 
single exponential time (specifically in 2'^^") steps where n is the size of the 
input formula defined as its number of atomic propositions and eventualities). 
This improves the best known computational complexity for finite trees. As 
such, this logic offers a new compromise between expressivity and complexity, 
specifically interesting in the context of XML. 

Another contribution of this thesis is to show how to linearly compile queries 
and regular tree types (including DTDs and XML Schemas) in the logic. The 
logic takes almost full XPath into account and supports the largest fragment 
that has been treated for static analysis. Another advantage is that the logic is a 
sublogic of an existing one (the /x-calculus) thus facilitating known optimization 
techniques and warranting extensibility. 

This solves the major decision problems needed in the static analysis of 
XML specifications. These problems involve containment, emptiness, equiva- 
lence, overlap, and coverage of XPath queries (in the presence or absence of 
regular tree types), static type-checking of an annotated XPath query, and 
XPath equivalence under type constraints. 

Furthermore, implementation techniques that yield concrete design and ef- 
fective algorithm implementation in practice are presented. The fully imple- 
mented system is already able to handle realistic scenarios. 

7.2 Perspectives 

There are a number of interesting and promising directions for further research 
that builds on the results and ideas developed in this dissertation. 

7.2.1 Further Optimizations of the Logical Solver 

The worst-case complexity upper bound for deciding cannot be less than 
exponential time (since it can be used to decide FTA containment, or alterna- 
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tively since it contains the CTL subsystem). Nevertheless, several techniques 
can be further developed for continuing to improve the performance of the 
implementation. One perspective is to use dynamic reordering of BDD vari- 
ables whenever it can speed up the decision procedure. Another interesting 
direction of further research is to attempt to statically reduce Lean contents 
by exploiting peculiarities of particular problem instances such as locality. 

7.2.2 Pushing the XPath Decidability Envelope Further 

One perspective of this thesis consists in extending the considered XPath 
fragment in order to support restricted data value comparisons (in the man- 
ner of [Bojanczyk ct al., 2006]). Another direction for extending the frag- 
ment consists in integrating related work on counting [Dal-Zilio ct al., 2004, 
Scidl ct al., 2004] to the logic. 

7.2.3 Enhancing the Translation of Regular Tree Types 

Another perspective consists in considering a modification of the translation of 
types such that it imposes the context of a type to also follow the regular tree 
language definition (stating for instance that the parent of a given node may 
only be some specific other nodes). This would allow a yet more precise and 
interesting reporting on type-checking instances. 

7.2.4 Efficiently Supporting Attributes and Data Values 

Most theoretical work on XML and XPath models XML documents by fi- 
nite labeled ordered trees, where the labels are taken from a finite alphabet. 
Attributes and data values are usually ignored. This thesis makes the same ab- 
stractions, and thus still offers perspectives for supporting more XML features. 
There is a reason for each previous widespread abstractions. 

The difficulty for supporting XML attributes arises from the fact that they 
are unordered ct nl.. 2004] which forces to consider mixed ordered and 

unordered contents in the underlying data model. There are several directions 
that can be followed for supporting constraints over mixed content while avoid- 
ing blow-ups caused by a naive modeling of unordered data on top of an ordered 
data model. Shuffle automata introduced in the 1970's provide a way to deal 
with an interleave operator [Jcdrzcjowitz and Szcpictowski. 2001]. The work 
found in [Dal-Zilio and Lugicz, 2006] introduces the Sheaves logic and a related 
new class of automata (sheaves automata) suited for ordered trees. The logic 
combines regularity and counting constraints, and provides an interleaving op- 
erator. The work found in [Murata and Hosoya, 2003] proposes an automata 
rewriting technique for handling attribute-element constraints, which has been 
implemented in a validator for RELAX NG. The approach presented in this 
dissertation can easily be extended for supporting unordered XML attributes 
in an alternative manner, by observing that the algorithm proposed in Chap- 
ter 6 is based on V'-types. Since a ■(/'-type is simply a set of formulas, attributes 
could naturally be modeled by a new class of atomic propositions, with the 
same complexity. 

The usual reason for ignoring data values comes from the fact that they 
quickly lead to languages whose static analysis is undecidable [Alon ct al., 2003, 
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Ncvcn and Schwcntick, 2003, Bcncdikt ct al., 2005]. Nevertheless, there ex- 
ists examples of decidable static reasoning tasks involving attribute values 
[Arenas ct al., 2005, Buneman ct al., 2003, Bojanczyk ct al., 2006]. A perspec- 
tive of this thesis is to extend the algorithm proposed in Chapter 6 to deal 
with attribute values. This could help at identifying more precisely the upper- 
bound complexity of decision problems involving XPath with limited data 
value comparison, which has been observed to be between NEXPTIME and 
3-NEXPTIME in the recent work found in [Bojanczyk ct al, 2006]. 

7.2.5 Query Optimization 

Another perspective of this thesis is to take advantage of the static analysis of 
XPath expressions for optimization purposes. This allows for instance to auto- 
matically detect contradictions and eliminate redundancies from XML queries 
at compile time, as preliminary investigated in [Geneves and Vion-Dury, 2004]. 
One perspective is to extend this work with some trace-based semantics for 
XPath (in the manner of ['Inrtcl. 2005]) in order to capture optimality of a 
query w.r.t a given evaluation context. Then, the optimal query could be cal- 
culated by using the automatic comparison of queries described in this thesis. 

7.2.6 Query Evaluation via Model-Checking 

The linear translation of XPath into the /i-calculus opens perspectives for query 
evaluation. A direction of future work consists in revisiting XPath evalua- 
tion (reduced to model-checking) based on existing techniques [Matccscu, 2002, 
Matccscu and Sighircanu, 2003]. 

7.2.7 Application to the Static Analysis of Transformations 

Last but not least, a perspective of this thesis is to apply the presented XPath 
static analysis techniques to the type-checking of XML transformation lan- 
guages. Results presented in this dissertation open the way to the construction 
of debuggers, compilers, and type-checkers for XSLT and XQuery. For ex- 
ample, the recent work found in [MoUor ol al., 2005] could benefit from using 
the exact algorithm of Chapter 6 instead of their conservative approximation. 
The practical experiments reported in Chapter 6 strengthen the hope for an 
effective analysis of this kind in the near future. 
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Appendix 

Computational Complexity for 
Logical Satisfiability Dealt With 

in this Dissertation 



Undecidable 



20(") 



Decidable 



WS2S [Meyer, 1975] used in Chapter 3. 



Elementary 



EXPSPACE 



EXPTIME 



20(n -login)) y^-calculus [Gradel et al., 2002]. 

20in- login)) AFMC [Tanabe et al., 2005] used in Chapter 4. 

20(n) £^ logic proposed in Chapters 5 and 6. 



PSPACE 



NP 



P (PTIME) 
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Motivation et objectifs 

Ce travail a ete initialement motive par le besoin d'analyseurs statiques efficaces 
pour les langages de manipulation de donnees et de documents XML. Ces lan- 
gages de programmation utilisent des schemas [Fallsidc and Walmsley, 2004] et 
des requetes XPath [Clark and DcRosc, 1999] comme constructions de premiere 
classe. Des exemples actuels de ces langages incluent la recommandation du 
W3C XSLT [Clark, 1999] pour la transformation de documents XML, et la fu- 
ture recommandation XQuery [i>ijag ct al., 200()] pour I'interrogation de bases 
de donnccs XML. Equiper ces langages de systemes decidables et efficaces pour 
la verification statique de types a ete I'un des defis de recherche majeurs de 
la derniere decennie, qui a entre autres rassemble les communautes travaillant 
sur les langages de programmation, les bases de donnees, les documents struc- 
tures, et I'informatique theorique. Ce travail poursuit I'effort de recherche initie 
dans les travaux decrits dans [Murata, 1996, Tozawa, 2001, Milo ct al., 2003, 
Hosoya and Pierce, 2003]. 

Ce travail a abouti a la conception d'une logique d'arbre finis adaptee a 
XML, et sa procedure de decision, presentees dans cette these. Le solveur 
logique a ete implante au coeur d'un systeme pour I'analyse statique generale 
et le typage des specifications XML. Le systeme pent etre utilise comme un 
composant d'analyseurs statiques pour les langages de programmation utilisant 
a la fois des expressions XPath et des types XML. 

Cette these presente les investigations theoriques qui ont conduit aux fonda- 
tions de cette nouvelle logique d'arbres finis, avec les bases algorithmiques et les 
principes d'implantation sur lesquels le solveur logique repose. Ces decouvertes 
sont appliquees a la resolution des problemes de typage XML, qui sont traduits 
dans la logique. Les problemes resolus incluent le typage statique du langage 
XPath en presence de types reguliers d'arbres. 

Documents XML et schemas 

Extensible Markup Language (XML) [Bray ct al., 2004] est un format de fichier 
texte pour la representation de structures arborescentes sous une forme stan- 
dard. 
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La structure complete d'un document XML, si on s'abstrait des details 
d'importance moindre, est un arbre d'arite variable, dans lequel les noeuds 
(aussi appeles elements dans le jargon XML) sont etiquettes, les feuilles de 
I'arbre sont des nceuds textes, et I'ordre entre les nceuds enfants est important. 
XML pent etre vu comme une syntaxe concrete pour la description de telles 
structures en utilisant des balises. Un exemple de document XML suit: 

<plante> 

<categorie>Vasculaire</ categorie> 
<tissu> 

<nom>Phloeme</iiom> 

<def>Le phloeme est un tissu vivant servcint au transport 

dans toutes les parties de la plante . </def > 
<note>Dajis les arbres, c'est une partie de 1 ' ecorce . </note> 
</tissu> 
</plajite> 

Un element est decrit par une paire composee d'une balise ouvrante < ... > 
et d'une balise fermante < /... >, entre lesquelles le contenu de I'element est 
insere. Dans I'exemple precedent "plcoite", "categorie", "tissu", "nom", 
"def", et "note" sont des etiquettes {noms d' element dans le jargon XML). 

La specification XML ne definit pas a priori I'ensemble des etiquettes per- 
mises dans un document XML, et ne definit pas non plus de semantique pour 
les etiquettes. Seules des conditions de bonne formation sont definies pour 
s' assurer que les elements sont bien imbriques, ce qui pcrmct dc considerer 
les documents XML comme les arbres. Par exemple, la Figure 1 donne une 
representation plus visuelle du precedent document XML bien forme. 




Phloeme Le (...) Dans (...) 



Figure 1: Exemple: arbre d'un document bien- forme. 

L 'ensemble des etiquettes qui apparaissent dans un document XML est 
determine par des schemas qui peuvent etre librement definis par les utilisa- 
teurs. Un schema (aussi appele un type XML) est une description des con- 
traintes sur la structure des documents, comme les etiquettes permises et leur 
possible structure d 'imbrication. Un schema definit ainsi une classe de docu- 
ments XML. Deux niveaux de correction peuvent done etre distingues pour les 
documents XML: 
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• le caractere bien-forme qui s'applique aux documents qui verifient la con- 
dition necessaire et suffisante (definie par la specification XML) pour etre 
interpretes comme des arbres; 

• la validite qui s'applique aux documents qui verifient les contraintes ad- 
ditionnelles decrites par un schema donne. 

La validite d'un document implique son caractere bien-forme puisque un 
schema decrit des contraintes sur I'arbre et non sur la representation textuelle 
du document XML. 

Chaque application pent definir son propre format de donnees en definissant 
des schemas, a un plus haut niveau d'abstraction (structures arborescentes). 
De ce fait, XML est souvent appele un metalangage ou un "format pour les 
formats de donnees" . 

Separer les deux niveaux de correction permet aux applications de partager 
des outils logiciels generiques pour manipuler des documents bien formes (anal- 
yseurs syntaxiques, editeurs, requetes, outils d'interrogation et de transfor- 
mation...). Ces outils implantent tous les memes conventions definies par la 
specification XML (comme la fagon d'inclure des commentaires, des fragments 
externes, des caracteres speciaux...). XML rend ainsi possible un premier 
niveau de traitement pour un document XML des lors qu'il est bien-forme, 
sans faire I'hypothese additionnelle beaucoup plus forte qu'il est valide par 
rapport a un certain schema. Cette genericite est I'une des forces de XML. 
En consequence, I'adoption de XML s'est faite a une vitesse et une ampleur 
inegalee. De nombreux schemas ont ete definis et sont actuellement largement 
utilises en pratique, par exemple: XHTML (la version XML de HTML), SVG 
(pour le graphisme vectoriel), SMIL (pour la synchronisation des documents 
multimedias) , MatML (pour les formules mathematiques), SOAP (pour I'appel 
de procedure a distance), XBRL et FIX (pour les informations financieres et les 
transactions de valeurs), SMD (pour la musique), X3D (pour la modelisation 
3D), et CML (pour les structures chimiques). 

XPath 

XPath [Clark and DcRose. 1999, Bcrglund et al., 2006] a ete introduit par le 
W3C comme le langage de requetes standard pour selectionner et recuperer de 
I'information dans les documents XML. II permet de naviguer dans les arbres 
XML et de retourner un ensemble de nceuds verifiant certaines conditions. En 
tant que tel, XPath forme I'essence de I'acces aux donnees XML. 

Dans leur forme la plus simple, les expressions XPath ressemblent a des 
"chemins de navigation dans les repertoires" . Par exemple, I'expression XPath 

/livre / chapitre / section 

navigue a partir de la racine d'un document (designee par le "/" en tete) a 
travers les nceuds "livre" au premier niveau, vers leurs noeuds enfants "chapitre" , 
jusqu'a leurs nosuds enfants nommes "section" . Le resultat de revaluation de 
I'expression complete est I'ensemble de tous les noeuds "section" qui peuvent 
etre atteints de cette maniere. De plus, a chaque etape de la navigation, les 
noeuds selectionnes peuvent etre filtres avec des qualifieurs. Un qualifieur est 
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une expression booleenne entre crochets qui peut tester I'existence ou I'absence 
de chemins. Si on formule par exemple la requete suivante : 

/livre / chapitre / section [citation] 

alors le resultat est constitue de tous les elements "section" qui out au moins un 
element fils nomme "citation" . La situation devient plus interessante lorsque 
les capacites de navigation de XPath selon d'autres "axes" que I'axe "child" 
sont utilisces. En effet, I'expression XPath precedente est un raccourci pour: 

/child: :livre/child::chapitre/child::section[child::citation] 

qui fait apparaitre de maniere explicite que chaque etape de navigation utilise 
I'axe "child" contenant tous les nccuds enfants des noeuds selectionnes lors de 
I'etape precedente. Si on formule la requete suivante : 

/ child: :livre / descendant: :* [child: :citation] 

alors la derniere etape selectionne les nceuds de n'importe quel nom qui sont 
parmi les descendants de I'element "livre" et qui ont un sous-element nomme 
"citation". II est aussi possible d'utiliser des axes comme "preceding-sibling" 
pour naviguer vers les precedents nceuds fils du meme parent, ou "ancestor" 
pour naviguer recursivement vers le haut (cf. Figure 2). L'ordre du document 
est defini comme l'ordre dans lequel les noeuds sont visites par un parcours en 
profondeur d'abord de I'arbre. Les axes qui effectuent de la navigation dans 
l'ordre inverse de l'ordre du document sont appeles "axes inverses" . 

Les exemples precedents illustrent tous des expressions XPath absolues 
puisqu'elles commencent toutes avec un "/" qui se refere a la racine. La 
semantique d'une expression relative (sans le "/" en tete) est definie par rapport 
a un nceud de contexte dans I'arbre. Le noeud de contexte designe simplement le 
noeud de I'arbre depuis lequel la navigation debute. A partir d'un nceud de con- 
texte quelconque dans un arbre, tous les autres nceuds peuvent etre facilement 
atteints: les axes XPath forment une partition de I'arbre. La Figure 2 illustre 
cela sur un exemple. Plus de details informels sur le langage XPath complet 
peuvent etre trouves dans la specification du W3C [Clark and DcRose, 1999]. 

XPath est de plus en plus populaire du fait de son expressivite et de sa syn- 
taxe compacte. Ces deux avantages ont confere a XPath un role central dans 
d'autres specifications cles et applications XML. II est utilise dans XQuery 
[Boag ct al., 200G] comme le langage coeur pour formuler des requetes; dans 
XSLT [Clark. 1999] pour la selection des noeuds dans les transformations; dans 
XML Schema [Fallsidc and Walnisley, 2004] pour definir les cles; dans XLink 
[DoRosc ct al., 2001] et XPointer [DcRosc ct al., 2002] pour referencer des par- 
ties de donnees XML. XPath est aussi utilise dans de nombreuses applications 
comme les langages de mise a jour [Sur et al., 2004] et de controle d'acces 
[Fan ct al., 2004]. 

Verification statique de type 

Les applications XML utilisent la plupart du temps les schemas pour effectuer 
de la validation (aussi appelee verification dynamique de type). La validation 
consiste en I'utilisation d'un validateur de schema qui analyse un document 
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self 
parent 
child 

preceding-sibling 
following-sibling 




descendant 

Figure 2: Partition des axes depuis un noeud de contexte. 



XML particulier par rapport a un certain schema dans le but de s'assurer que 
le document est bien conforme aux attentes de I'application. 

En pratique cependant, les documents XML sont souvent generes dynamique- 
ment par un certain programme. Typiquement, les programmes qui manipulent 
du XML accedent tout d'abord aux donnees (se conformant possiblement a un 
certain schema) avec des expressions XPath, et construisent et retournent en- 
suite un document XML resultat qui se conforme aux exigeances d'un autre 
schema. 

Une approche ambitieuse est la verification statique de type pour ces pro- 
grammes, qui consiste a s'assurer au moment de la compilation, que le code 
traitant les donnees XML ne pent pas produire de document non valide. Un 
verificateur statique de type analyse un programme, possiblement en presence 
des schemas qui decrivent ses entrees et sorties (si ceux-ci s'averent disponibles). 
La difficulte du probleme est fonction du langage dans lequel le programme et 
les schemas sont exprimes. 

Les langages de schemas ont fait I'objet de nombreuses etudes et sont main- 
tenant bien compris comme des sous-ensembles des langages reguliers d'arbres 
[Murata et al., 2005]. Cependant, bien que de nombreuses tentatives aient ete 
faites pour mieux comprendre les techniques de typage statique, en particulier 
a travers la conception de langages de programmation specifiques au domaine 
[Hosoya and Pierce, 2003], aucune approche est effectivement capable de sup- 
porter XPath, qui demeure neanmoins I'essence de la navigation et de I'acces 
aux donnees XML. 



Defis de recherche 

Les limitations des approches existantes sont justifiees par la difficulte de 
I'analyse statique de XPath. II est connu que I'analyse statique du langage 
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XPath complet est indecidable. L'importance et Tampleur des applications 
motivent neanmoins des questions de recherche: quel est le plus gros fragment 
de XPath dont I'analyse statique est decidable ? Quels fragments peuvent 
etre efhcacement decides en pratique ? Comment determiner si une expression 
XPath est satisfaisable sur I'un des arbres XML definis par un schema donne 
? Comment savoir si deux requetes vont toujours donner le meme resultat 
lorsqu'elles sont evaluees sur un document valide par rapport a un certain 
schema ? Est ce que le resultat d'une expression XPath sur un document 
valide se conforme toujours aux exigeances d'un autre schema ? Existe-t-il un 
algorithme capable de repondre a ces questions d'une maniere efficace de telle 
sorte qu'il soit utilisable en pratique ? 

Une source de difficulte pour un tel algorithme est qu'il doit verifier des 
proprietes sur une quantification possiblement infinie sur un ensemble d' arbres. 
Une variete d'autres facteurs contribuent de plus a sa complexite comme les 
operateurs permis dans les requetes XPath et leur composition (cf. Chapitre 2.2) 
Une consequence de ces difficultes est que de telles questions de recherche sont 
toujours ouvertes. 

Apergu de cette these 

Cette these part de I'idee que deux problemes doivent etre resolus pour pouvoir 
repondre a des problemes de decision dans le monde XML. Tout d'abord, iden- 
tifier une logique appropriee avec une expressivite suffisante pour supporter a la 
fois les langages d'arbres reguliers et la navigation et la semantique de selection 
de noeuds a la XPath. Deuxiemement, resoudre efhcacement le probleme de la 
satisfaisabilite de cette logique qui permet de determiner si une formule donnee 
de la logique admet un document XML qui la satisfait. 

Principales contributions 

La contribution principale de cette these est une nouvelle logique pour les 
arbres finis, derivee du /i-calcul. La logique est suffisamment expressive pour 
capturer les langages reguliers d'arbres et la navigation multi-directionelle dans 
les arbres finis. EUe est decidable en temps simplement exponentiel (plus 
precisement en 2"^^"' etapes oii n est la faille de la formule dont le statut 
de verite est determine definie comme le nombre de propositions atomiques et 
d'eventualites qu'elle comporte). Cela ameliore la meilleure complexite com- 
putationnelle connue pour les arbres finis. En tant que telle, cette logique 
offre un nouveau compromis entre expressivite et complexite, specifiquement 
interessant dans le contexte de XML). 

Une autre contribution de cette these est de montrer comment traduire 
lineairement les requetes et les types reguliers d'arbres (incluant les DTDs et 
les XML Schemas) dans la logique. La logique prend en compte XPath dans sa 
quasi globalite, et supporte le plus gros fragment qui a ete traite pour I'analyse 
statique. Un autre avantage est que la logique est une sous-logique d'une 
existante (le /i-calcul) ce qui facilite I'application de techniques d'optimisation 
connues et I'extensibilite. 

Cela resout les problemes de decision majeurs rencontres dans I'analyse sta- 
tique des langages manipulant des structures XML. Ces problemes englobent 
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I'inclusion, la satisfaisabilite, I'equivalence, le recouvrement, la couverture des 
requetes XPath (en presence ou absence de types reguliers d'arbres), le ty- 
page statique d'une requete XPath annotee, et I'equivalence des requetes sous 
contraintes de type. 

De plus, des techniques d'implantation sont presentees, qui conduisent a la 
realisation concrete et a I'implantation d'algorithmes efficaces en pratique. Le 
systeme entierement implante est deja capable de traiter des scenarios realistes. 



Applications 

La principale application de ce travail est une nouvelle classe d'analyseurs sta- 
tiques de programmes manipulant des donnees et des documents XML. Cette 
classe d'analyseurs utilise directement les resultats decrits dans cette these, qui 
resolvent les problemes de decision auxquels ils sont confrontes. La resolution 
de chaque probleme particulicr offre des applications importantes. 

Par exemple, le probleme le plus fondamental pour un langage de requete est 
la satisfaisabilite: une expression retourne-t-ellc toujours un resultat vide ? La 
satisfaisabilite de XPath est importante pour I'optimisation des langages hotes 
de XPath: par exemple, si on pent savoir au moment de la compilation qu'une 
requete est insatisfaisable, alors tous les calculs qui en dependent peuvent etre 
evites. 

Un autre probleme fondamental est le probleme de I'equivalence: deux 
requetes retournent-elles toujours les memes resultats ? Savoir determiner 
I'equivalence entre deux requetes est utile pour la reformulation et I'optimisation 
de la requete elle-meme [Geneves and Vion-Dury, 2004], qui vise a s'assurer 
de proprietes operationnelles tout en preservant la semantique de la requete 
[Abiteboul and Vianu, 1999, Levin and Pierce, 200-'')]. 

Le probleme le plus critique pour le typage statique des transformations 
XML est I'inclusion de requetes XPath: est ce que, pour tout arbre, le resultat 
d'une requete particuliere est inclus dans le resultat d'une autre ? Ce probleme 
se pose pour I'analyse du flot de controle de XSLT [Mailer et al., 2005]. Savoir 
determiner I'inclusion est aussi utile pour verifier les contraintes d'integrites 
[Fallside and Walnislcy, 2004], et pour verifier la politique et les droits d'acces 
dans les applications de securite XML [Fan et al., 2004]. 

D'autres problemes de decision utiles dans les applications incluent par 
exemple la couverture mutuelle des requetes (deux expressions peuvent-elles 
selectionner des nceuds communs ?) et la couverture (les nosuds selectionnes par 
une requete sont-ils toujours contenus dans I'union des resultats selectionnes 
par d'autres requetes ?). Ces problemes sont par exemple utiles pour detecter 
statiquement les erreurs de programmation. 

Cette these resout ces problemes de decision, en presence ou en 1' absence 
de contraintes de types XML comme les DTDs [Bray et al., 2004] ou les XML 
Schemas [Fallside and Walmsley, 2004]. Cela permet de s'assurer de proprietes 
locales ou globales importantes (comme le bon typage ou des optimisations) 
au moment de la compilation, pour un traitement plus siir et plus efficace des 
donnees XML. Les resultats presentes dans cette these ouvrent notamment des 
perspectives prometteuses concernant I'analyse statique des transformations 
XML. 
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Organisation de la these 

Cette these est divisee en trois parties. La premiere partie est dediee a I'etat de 
Fart et presente les techniques de pointe existantes et les travaux de recherche 
relies. A cette fin, le chapitre 2 introduit quelques fondations theoriques et 
formalismes utilises dans la suite de cette these, tout en introduisant progres- 
sivement les travaux relies au fur et a mesure que leurs concepts sous-jacents 
ont ete presentes. 

Dans une seconde partie, les chapitres 3 et 4 conduisent des investiga- 
tions preliminaires avec des logiques connues dans le cadre de XML. Plus 
precisement, le chapitre 3 etudie dans quelle mesure la logique monadique du 
second ordre pent etre utilisee en pratique, en depit de sa grande complexite, 
pour resoudre des problemes d'analyse statique comme I'inclusion des requetes 
XPath. Une procedure de decision correcte pour I'inclusion est proposee. Le 
chapitre 4 introduit le /x-calcul sans alternance comme un puissant remplace- 
ment pour la logique monadique du second ordre, et etudie son usage pour 
raisonner sur les arbres XML. Les problemes de decision mettant en jeu les 
requetes XPath et les types reguliers sont reduits a la satisfaisabilite de cette 
logique sur des structures de Kripke generales. 

Grace aux legons tirees des investigations precedemment menees, la troisieme 
partie de cette these presente la contribution finale. Le chapitre 5 propose une 
logique d'arbres finis specifiquement congue pour XML. Le chapitre 6 propose 
un algorithme pour tester la satisfaisabilite de la logique, ainsi que les tech- 
niques pour son implantation. Des experimentations sont menees avec une 
implantation complete du systeme, qui s'avere efficace sur plusieurs scenarios 
realistes. Enfin, le chapitre 7 conclut cette these et donne de nouvelles per- 
spectives. 
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